FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Staff Machine Learning Engineer, AI Serving
Reddit, Inc.Staff Machine Learning Engineer developing large-scale ML Inference Platform at Reddit. Leading design and maintenance of GPU-based model serving system while collaborating across teams.
Tech Stack
Tools & technologiesAWSCloudGoKubernetesPythonPyTorchTerraform
About the role
Key responsibilities & impact- Lead the end-to-end design, implementation, and maintenance of a highly available, low-latency GPU-based model serving system for search, ranking, and LLMs supporting Millions of QPS.
- Design and develop ML and Generative AI systems in cloud-based production environments on Kubernetes at scale.
- Rapidly develop prototypes and develop a high-performance feature hydration and processing system as a part of the inference stack - including routing, caching, and batching.
- Lead a unified GPU model export framework to support converting trained models into optimized GPU inference models.
- Strong understanding of real-time ML observability to track feature/model performance.
- Experience working with LLM serving online at scale.
- Built an E2E inference performance benchmarking framework
- Deep Understanding of multi-cluster compute environment and network topology that is specific to ML inference use cases.
Requirements
What you’ll need- 7+ years of experience in ML Engineering, AI Platform Engineering, or Cloud AI Deployment roles.
- Have experience operating orchestration systems such as Kubernetes at scale
- Deep experience with cloud-based technologies for supporting an ML platform, including tools like AWS, Google Cloud Storage, infrastructure-as-code (Terraform), and more
- Proficiency with the common programming languages and frameworks of ML, such as Go, Python, etc.
- Excellent communication skills with the ability to articulate technical AI concepts to non-technical stakeholders
- Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the genAI product development lifecycle.
- Strong knowledge of model serving, inference pipelines, monitoring, and observability for AI systems is a plus
- Strong proficiency in Python and deep experience with modern AI/ML frameworks (Triton, Dynamo, vLLM, Pytorch)
Benefits
Comp & perks- Comprehensive Healthcare Benefits and Income Replacement Programs
- 401k with Employer Match
- Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
- Family Planning Support
- Gender-Affirming Care
- Mental Health & Coaching Benefits
- Flexible Vacation & Paid Volunteer Time Off
- Generous Paid Parental Leave
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
ML EngineeringAI Platform EngineeringCloud AI DeploymentKubernetesAWSGoogle Cloud StorageTerraformPythonGoTriton
Soft Skills
communication skillsarticulate technical conceptsfocus on scalabilityfocus on reliabilityfocus on performanceadvocate for platform usersintuition for product development lifecycle