SumerSports

MLOps, ML Platform Engineer

SumerSports

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Design and operate ML infrastructure: Manage data, training, serving, and inference systems for high-throughput model workflows
  • Build scalable pipelines: Implement reproducible training and evaluation pipelines with versioning, scheduling, and artifact tracking
  • Optimize compute and cost: Tune GPU and CPU workloads, manage clusters, and drive efficiency via rightsizing, spot scheduling, and caching
  • Serve models in production: Operate APIs for low-latency inference with autoscaling, blue-green or canary rollouts, and rollback safety
  • Ensure reliability and observability: Define and own SLOs; instrument pipelines and services to track latency, cost, drift, and data quality
  • Secure and automate: Manage IAM, secrets, and container security; automate deployment pipelines via CI/CD and infrastructure as code
  • Collaborate cross-functionally: Partner with research scientists and AI engineers to deliver models from experiment to production with minimal friction
  • Document and enable: Build templates, runbooks, and internal tooling that make ML workflows repeatable, safe, and fast

Requirements

  • 4+ years of experience in ML platform, DevOps, or infrastructure engineering
  • Deep knowledge of Kubernetes, CI/CD, containers, and cloud infrastructure (AWS, GCP, or Azure)
  • Hands-on experience managing GPU clusters and training/inference pipelines
  • Familiarity with data orchestration and storage formats (Delta, Parquet, Polars, Spark)
  • Proven ability to ship and operate production ML systems with SLOs
  • Strong Python skills and comfort with infrastructure as code and automation
  • Experience with observability and cost optimization at scale
Benefits
  • Competitive Salary and Bonus Plan
  • Comprehensive health insurance plan
  • Retirement savings plan (401k) with company match
  • Remote working environment
  • A flexible, unlimited time off policy
  • Generous paid holiday schedule - 13 in total including Monday after the Super Bowl
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
machine learning infrastructuredata managementtraining systemsinference systemspipeline implementationGPU managementcost optimizationPythoninfrastructure as codeobservability
Soft Skills
collaborationdocumentationproblem-solvingcommunication