FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesFlash
About the role
Key responsibilities & impact- Design and deploy state-of-the-art model serving architectures that deliver high throughput and low latency while optimizing memory usage.
- Ensure these pipelines run efficiently across diverse environments, including resource-constrained devices and edge platforms.
- Establish clear performance targets such as reduced latency, improved token response, and minimized memory footprint.
- Build, run, and monitor controlled inference tests in both simulated and live production environments.
- Track key performance indicators such as response latency, throughput, memory consumption, and error rates, with special attention to metrics specific to resource-constrained devices.
- Document iterative results and compare outcomes against established benchmarks to validate performance across platforms.
- Identify and prepare high-quality test datasets and simulation scenarios tailored to real-world deployment challenges, specifically those encountered on low-resource devices.
- Set measurable criteria to ensure that these resources effectively evaluate model performance, latency, and memory utilization under various operational conditions.
- Analyze computational efficiency and diagnose bottlenecks in the serving pipeline by monitoring both processing and memory metrics.
- Address issues such as suboptimal batch processing, network delays, and high memory usage to optimize the serving infrastructure for scalability and reliability on resource-constrained systems.
- Work closely with cross-functional teams to integrate optimized serving and inference frameworks into production pipelines designed for edge and on-device applications.
- Define clear success metrics such as improved real-world performance, low error rates, robust scalability, optimal memory usage and ensure continuous monitoring and iterative refinements for sustained improvements.
Requirements
What you’ll need- A degree in Computer Science or related field.
- Ideally PhD in NLP, Machine Learning, or a related field, complemented by a solid track record in AI R&D (with good publications in A* conferences).
- Must have knowledge of Metal Shading Language (MSL).
- Proven experience in low-level kernel optimizations and inference optimization on mobile devices is essential.
- Your contributions should have led to measurable improvements in inference latency, throughput, and memory footprint for domain-specific applications, particularly on resource-constrained devices and edge platforms.
- A deep understanding of modern model serving architectures and inference optimization techniques is required.
- Must have strong expertise in writing GPU kernels for mobile devices (i.e., smartphones) as well as a deep understanding of model serving frameworks and engines.
- Practical experience in developing and deploying end-to-end inference pipelines, from optimizing models for efficient serving to integrating these solutions on resource-constrained devices is required.
- Demonstrated ability to apply empirical research to overcome challenges in model serving, such as latency optimization, computational bottlenecks, and memory constraints.
- You should be proficient in designing robust evaluation frameworks and iterating on optimization strategies to continuously push the boundaries of inference performance and system efficiency.
- Distributed Inference Systems: Designing and optimizing high-performance inference engines using techniques like Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism to handle massive models on GPU clusters.
- Deep understanding of the math and structure behind Diffusion Models and Vision Transformers.
- Understanding of Pruning, Quantization, Flash attention, KV Cache, Speculative Decoding (Eagle) etc.
Benefits
Comp & perks- Health insurance
- 401(k) matching
- Flexible work hours
- Paid time off
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Metal Shading Language (MSL)GPU kernelsinference optimizationmodel serving architecturesend-to-end inference pipelineslatency optimizationcomputational bottlenecksPruningQuantizationDiffusion Models
Soft Skills
cross-functional collaborationanalytical skillsproblem-solvingdocumentationiterative refinementperformance analysiscommunicationresearch applicationevaluation framework designscalability focus
Certifications
PhD in NLPPhD in Machine Learningdegree in Computer Science
