Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Salesforce

Lead Machine Learning Engineer, LLM Infrastructure

Salesforce

. Design, build, and maintain infrastructure for LLM post-training, evaluation, and deployment.

Posted 4/28/2026full-timeSan Francisco • California • 🇺🇸 United StatesSenior💰 $172,500 - $260,100 per yearWebsite

Tech Stack

Tools & technologies
AWSCloudDockerGoogle Cloud PlatformKubernetesPython

About the role

Key responsibilities & impact
  • Design, build, and maintain infrastructure for LLM post-training, evaluation, and deployment.
  • Own scalable pipelines for training orchestration, rollout generation, reward and feedback processing, checkpointing, and experiment management.
  • Build reliable systems for feedback-driven model improvement, including human or AI feedback loops, large-scale offline evaluation, and regression detection.
  • Partner closely with research scientists to turn new post-training methods into reusable engineering workflows.
  • Collaborate with agent engineers and platform teams to integrate training and evaluation systems with production model and agent stacks.
  • Optimize distributed training and inference workloads for reliability, throughput, cost efficiency, and observability.
  • Drive best practices for reproducibility, versioning, monitoring, deployment, and operational excellence across ML systems.

Requirements

What you’ll need
  • 5+ years of experience in software engineering, ML systems, or distributed infrastructure.
  • Strong proficiency in Python and experience building production systems or large-scale ML pipelines.
  • Hands-on experience building infrastructure for model training, post-training, evaluation, or serving.
  • Experience designing reliable, scalable systems for distributed and GPU-based workloads.
  • Strong debugging skills across systems, pipelines, and model-facing failures.
  • Experience building infrastructure for LLM post-training, including RLHF, preference optimization, reward modeling, or related feedback-driven training workflows.
  • Experience working cross-functionally with research scientists and engineers.
  • Familiarity with cloud platforms (AWS, GCP) and containerized environments (Docker, Kubernetes).

Benefits

Comp & perks
  • Health insurance
  • 401(k) matching
  • Flexible work hours
  • Paid time off
  • Remote work options
  • Professional development opportunities
  • Employee stock purchasing program
  • Mental health support
  • Life and disability insurance

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Pythondistributed infrastructureML systemslarge-scale ML pipelinesreliable systems designdebuggingLLM post-trainingRLHFpreference optimizationreward modeling
Soft Skills
collaborationcross-functional teamworkproblem-solving