
MLOps, ML Platform Engineer
SumerSports
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
About the role
- Design and operate ML infrastructure: Manage data, training, serving, and inference systems for high-throughput model workflows
- Build scalable pipelines: Implement reproducible training and evaluation pipelines with versioning, scheduling, and artifact tracking
- Optimize compute and cost: Tune GPU and CPU workloads, manage clusters, and drive efficiency via rightsizing, spot scheduling, and caching
- Serve models in production: Operate APIs for low-latency inference with autoscaling, blue-green or canary rollouts, and rollback safety
- Ensure reliability and observability: Define and own SLOs; instrument pipelines and services to track latency, cost, drift, and data quality
- Secure and automate: Manage IAM, secrets, and container security; automate deployment pipelines via CI/CD and infrastructure as code
- Collaborate cross-functionally: Partner with research scientists and AI engineers to deliver models from experiment to production with minimal friction
- Document and enable: Build templates, runbooks, and internal tooling that make ML workflows repeatable, safe, and fast
Requirements
- 4+ years of experience in ML platform, DevOps, or infrastructure engineering
- Deep knowledge of Kubernetes, CI/CD, containers, and cloud infrastructure (AWS, GCP, or Azure)
- Hands-on experience managing GPU clusters and training/inference pipelines
- Familiarity with data orchestration and storage formats (Delta, Parquet, Polars, Spark)
- Proven ability to ship and operate production ML systems with SLOs
- Strong Python skills and comfort with infrastructure as code and automation
- Experience with observability and cost optimization at scale
Benefits
- Competitive Salary and Bonus Plan
- Comprehensive health insurance plan
- Retirement savings plan (401k) with company match
- Remote working environment
- A flexible, unlimited time off policy
- Generous paid holiday schedule - 13 in total including Monday after the Super Bowl
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
machine learning infrastructuredata managementtraining systemsinference systemspipeline implementationGPU managementcost optimizationPythoninfrastructure as codeobservability
Soft Skills
collaborationdocumentationproblem-solvingcommunication