Lead Machine Learning Engineer, LLM Infrastructure

Salesforce

. Design, build, and maintain infrastructure for LLM post-training, evaluation, and deployment.

Posted 4/28/2026full-timeSan Francisco • California • 🇺🇸 United StatesSenior💰 $172,500 - $260,100 per yearWebsite

Tech Stack

Tools & technologies

AWSCloudDockerGoogle Cloud PlatformKubernetesPython

About the role

Key responsibilities & impact

Design, build, and maintain infrastructure for LLM post-training, evaluation, and deployment.
Own scalable pipelines for training orchestration, rollout generation, reward and feedback processing, checkpointing, and experiment management.
Build reliable systems for feedback-driven model improvement, including human or AI feedback loops, large-scale offline evaluation, and regression detection.
Partner closely with research scientists to turn new post-training methods into reusable engineering workflows.
Collaborate with agent engineers and platform teams to integrate training and evaluation systems with production model and agent stacks.
Optimize distributed training and inference workloads for reliability, throughput, cost efficiency, and observability.
Drive best practices for reproducibility, versioning, monitoring, deployment, and operational excellence across ML systems.

Requirements

What you’ll need

5+ years of experience in software engineering, ML systems, or distributed infrastructure.
Strong proficiency in Python and experience building production systems or large-scale ML pipelines.
Hands-on experience building infrastructure for model training, post-training, evaluation, or serving.
Experience designing reliable, scalable systems for distributed and GPU-based workloads.
Strong debugging skills across systems, pipelines, and model-facing failures.
Experience building infrastructure for LLM post-training, including RLHF, preference optimization, reward modeling, or related feedback-driven training workflows.
Experience working cross-functionally with research scientists and engineers.
Familiarity with cloud platforms (AWS, GCP) and containerized environments (Docker, Kubernetes).

Benefits

Comp & perks

Health insurance
401(k) matching
Flexible work hours
Paid time off
Remote work options
Professional development opportunities
Employee stock purchasing program
Mental health support
Life and disability insurance

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Pythondistributed infrastructureML systemslarge-scale ML pipelinesreliable systems designdebuggingLLM post-trainingRLHFpreference optimizationreward modeling

Soft Skills

collaborationcross-functional teamworkproblem-solving