FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
About the role
Key responsibilities & impact- Drive innovation in model serving and inference architectures for advanced AI systems
- Focus on optimizing model deployment and inference strategies to deliver highly responsive, efficient, and scalable performance
- Work on a wide spectrum of systems, ranging from resource-efficient models designed for limited hardware environments to complex, multi-modal architectures
- Engineering robust inference pipelines, establishing comprehensive performance metrics, and identifying and resolving bottlenecks
- Enable high-throughput, low-latency, low-memory footprint, and scalable AI performance that delivers tangible value in dynamic, real-world scenarios
- Design and deploy state-of-the-art model serving architectures that deliver high throughput and low latency while optimizing memory usage
- Build, run, and monitor controlled inference tests in both simulated and live production environments
- Track key performance indicators such as response latency, throughput, memory consumption, and error rates
- Document iterative results and compare outcomes against established benchmarks
- Identify and prepare high-quality test datasets and simulation scenarios tailored to real-world deployment challenges
Requirements
What you’ll need- A degree in Computer Science or related field
- Ideally PhD in NLP, Machine Learning, or a related field, complemented by a solid track record in AI R&D (with good publications in A* conferences)
- Must have knowledge of Metal Shading Language (MSL)
- Proven experience in low-level kernel optimizations and inference optimization on mobile devices is essential
- Deep understanding of modern model serving architectures and inference optimization techniques is required
- Strong expertise in writing GPU kernels for mobile devices (i.e., smartphones) as well as a deep understanding of model serving frameworks and engines
- Practical experience in developing and deploying end-to-end inference pipelines
- Demonstrated ability to apply empirical research to overcome challenges in model serving
- Distributed Inference Systems: Designing and optimizing high-performance inference engines using techniques like Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism to handle massive models on GPU clusters.
- Deep understanding of the math and structure behind Diffusion Models and Vision Transformers.
Benefits
Comp & perks- Health insurance
- Work from anywhere
- Professional development opportunities
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
model serving architecturesinference optimizationMetal Shading Language (MSL)GPU kernelsend-to-end inference pipelinesTensor ParallelismPipeline ParallelismExpert ParallelismDiffusion ModelsVision Transformers
Soft Skills
innovationproblem-solvingempirical research applicationdocumentationperformance metrics analysis
Certifications
PhD in NLPPhD in Machine Learning
