
Staff/Principal ML Ops Engineer
Pragmatike
full-time
Posted on:
Location Type: Hybrid
Location: Cambridge • Massachusetts • United States
Visit company websiteExplore more
Job Level
Tech Stack
About the role
- Architect, build, and scale the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring.
- Design reliable infrastructure for model deployment, versioning, reproducibility, and orchestration across cloud and on-prem GPU clusters.
- Optimize compute usage across distributed systems (Kubernetes, autoscaling, caching, GPU allocation, checkpointing workflows).
- Lead the implementation of observability for ML systems (monitor drift, performance, throughput, reliability, cost).
- Build automated workflows for dataset curation, labeling, feature pipelines, evaluation, and CI/CD for ML models.
- Collaborate with researchers to productionize models and accelerate training/inference pipelines.
- Establish ML Ops best practices, internal standards, and cross-team tooling.
- Mentor engineers and influence architectural direction across the entire AI platform.
Requirements
- Deep hands-on experience designing and operating production ML systems at scale (Staff/Principal-level expected).
- Strong background in ML Ops, distributed systems, and cloud infrastructure (AWS, GCP, or Azure).
- Proficiency with Python and familiarity with TypeScript or Go for platform integration.
- Expertise in ML frameworks: PyTorch, Transformers, vLLM, Llama-factory, Megatron-LM, CUDA / GPU acceleration (practical understanding)
- Strong experience with containerization and orchestration (Docker, Kubernetes, Helm, autoscaling).
- Deep understanding of ML lifecycle workflows: training, fine-tuning, evaluation, inference, model registries.
- Ability to lead technical strategy, collaborate cross-functionally, and operate in fast-paced environments
Benefits
- Competitive salary & equity options
- Sign-on bonus
- Health, Dental, and Vision
- 401k
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
ML Opsdistributed systemscloud infrastructurePythonTypeScriptGoPyTorchTransformersCUDAML lifecycle workflows
Soft Skills
leadershipcollaborationmentoringtechnical strategycross-functional collaborationfast-paced environment