Machine Learning Operations Engineer

Modulate

ML Operations Engineer responsible for reliability and efficiency of production systems at Modulate. Working on scaling machine learning models and collaborating with engineering teams.

Posted 5/13/2026full-timeSomerville • Massachusetts • 🇺🇸 United StatesMid-LevelSenior💰 $150,000 - $200,000 per yearWebsite

Tech Stack

Tools & technologies

AWSLinuxPythonPyTorchTerraform

About the role

Key responsibilities & impact

Own the reliability and performance of ML model inference systems in production
Ensure high availability of deployed models across APIs and enterprise products
Build systems to handle scaling, load variability, and production traffic growth
Reduce operational burden through better tooling, automation, and processes
Help define how Modulate runs ML systems at scale with reliability and efficiency
Deploy, monitor, and maintain production machine learning inference systems
Oversee fleets of inference machines and ensure system health and performance
Design monitoring, alerting, and incident response systems for ML workloads
Participate in on-call rotations and lead incident response and debugging
Build systems and processes for scaling inference infrastructure under variable load
Improve reliability and observability of production ML services
Collaborate on infrastructure-as-code for production deployments
Support or contribute to GPU-based training and inference infrastructure
Work closely with ML and engineering teams to ensure smooth model deployments
(Optional growth area) Optimize model inference performance and latency

Requirements

What you’ll need

Experience deploying and maintaining production software systems
Experience building monitoring and alerting systems for production environments
Experience with on-call rotations and incident response
Strong experience with AWS, Python, and Linux
Exposure to PyTorch or similar ML frameworks
Experience working with GPU-based applications and basic GPU tooling (drivers, runtime, monitoring)
Strong debugging and systems thinking skills
Ability to operate calmly in production incident environments
Nice to Have
Experience with ML model serving systems or dedicated model servers
Experience monitoring GPU performance for inference workloads
Experience optimizing machine learning model inference
Familiarity with audio or multimedia data (codecs, streaming, real-time systems)
Experience with infrastructure-as-code (e.g., Terraform, CloudFormation)

Benefits

Comp & perks

Competitive salary + equity
Full health, dental, and vision coverage
Flexible PTO with strong culture of taking it
Weekly team lunches with dietary accommodations
Hybrid work with core in-office days and flexible remote options
Leadership and technical learning sessions
Career development and continued learning support
Up to 8 weeks work-from-anywhere policy
A deeply inclusive, human-centered culture

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

machine learningmodel inferenceAWSPythonLinuxPyTorchGPU-based applicationsinfrastructure-as-codemonitoring systemsincident response

Soft Skills

debuggingsystems thinkingcalmness in production incidentscollaboration