Sumo Logic

Senior Machine Learning Engineer – MLOps, LLMOps

Sumo Logic

full-time

Posted on:

Location Type: Remote

Location: CaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Design and implement scalable MLOps/LLMOps platforms supporting the full ML lifecycle: data versioning, model training, evaluation, deployment, and monitoring
  • Build and maintain CI/CD pipelines for ML models and LLM applications with automated testing, validation, and rollback capabilities
  • Develop infrastructure-as-code (IaC) for reproducible, version-controlled ML environments
  • Architect model serving infrastructure with auto-scaling, A/B testing, and canary deployment capabilities
  • Build platforms for LLM fine-tuning, prompt management, and experimentation at scale
  • Implement evaluation frameworks for LLM performance, quality, safety, and cost optimization
  • Design and deploy enterprise-grade AI agents and copilots with robust monitoring and guardrails
  • Establish LLM observability: token usage tracking, latency monitoring, prompt/response logging, and cost attribution
  • Own uptime, reliability, and performance of ML/LLM services (SLIs/SLOs)
  • Implement comprehensive monitoring, alerting, and incident response for ML systems
  • Participate in on-call rotations and drive post-incident reviews to improve system resilience
  • Build automation and tooling to reduce toil and accelerate ML development velocity
  • Partner with ML Engineers and Data Scientists to translate research into production-ready systems
  • Collaborate with platform and infrastructure teams on cloud architecture and resource optimization
  • Mentor team members on MLOps best practices, production ML patterns, and operational excellence
  • Drive technical decisions on tooling, frameworks, and architectural patterns

Requirements

  • Education: B.S./M.S./Ph.D. in Computer Science, Engineering, or related technical field
  • Experience: 4+ years of software engineering experience with 2+ years focused on MLOps/LLMOps
  • MLOps Expertise:
  • Production experience with ML model serving frameworks (e.g., TensorFlow Serving, TorchServe, Triton)
  • Hands-on with ML experiment tracking and model registry tools (MLflow, Weights & Biases, Kubeflow)
  • Proficiency in workflow orchestration (Airflow, Prefect, Kubeflow Pipelines, Metaflow)
  • LLMOps Expertise:
  • Experience with LLM deployment, fine-tuning, and evaluation frameworks (e.g., vLLM, LangChain, LlamaIndex)
  • Knowledge of prompt engineering, RAG architectures, and LLM application patterns
  • Familiarity with LLM observability tools (e.g., LangSmith, Arize, WhyLabs)
  • Cloud & Infrastructure:
  • Strong experience with major cloud providers (AWS, GCP, or Azure) and ML-specific services (SageMaker, Vertex AI, Azure ML, Bedrock)
  • Proficiency in containerization (Docker, Kubernetes) and infrastructure-as-code (Terraform, CloudFormation, Pulumi)
  • Experience with microservices architecture and API development (REST, gRPC)
  • Software Engineering:
  • Strong programming skills in Python, terraform and Helm; familiarity with Go, Java, or Rust is a plus
  • Deep understanding of CI/CD practices and tools (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
  • Experience with monitoring and observability stacks (Prometheus, Grafana, DataDog, ELK)
  • Operational Excellence:
  • Track record of managing production systems with defined SLIs/SLOs
  • Experience with on-call rotations, incident management, and reliability engineering practices.
Benefits
  • Compensation varies based on a variety of factors which include (but aren’t limited to) role level, skills and competencies, qualifications, knowledge, location, and experience.
  • In addition to base pay, certain roles are eligible to participate in our bonus or commission plans, as well as our benefits offerings, and equity awards.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
MLOpsLLMOpsML model serving frameworksML experiment trackingworkflow orchestrationLLM deploymentprompt engineeringcontainerizationAPI developmentCI/CD practices
Soft Skills
mentoringcollaborationtechnical decision-makingincident managementsystem resilience
Certifications
B.S.M.S.Ph.D.