Pragmatike

Staff/Principal ML Ops Engineer

Pragmatike

full-time

Posted on:

Location Type: Hybrid

Location: CambridgeMassachusettsUnited States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Architect, build, and scale the end-to-end ML Ops pipeline, including training, fine-tuning, evaluation, rollout, and monitoring.
  • Design reliable infrastructure for model deployment, versioning, reproducibility, and orchestration across cloud and on-prem GPU clusters.
  • Optimize compute usage across distributed systems (Kubernetes, autoscaling, caching, GPU allocation, checkpointing workflows).
  • Lead the implementation of observability for ML systems (monitor drift, performance, throughput, reliability, cost).
  • Build automated workflows for dataset curation, labeling, feature pipelines, evaluation, and CI/CD for ML models.
  • Collaborate with researchers to productionize models and accelerate training/inference pipelines.
  • Establish ML Ops best practices, internal standards, and cross-team tooling.
  • Mentor engineers and influence architectural direction across the entire AI platform.

Requirements

  • Deep hands-on experience designing and operating production ML systems at scale (Staff/Principal-level expected).
  • Strong background in ML Ops, distributed systems, and cloud infrastructure (AWS, GCP, or Azure).
  • Proficiency with Python and familiarity with TypeScript or Go for platform integration.
  • Expertise in ML frameworks: PyTorch, Transformers, vLLM, Llama-factory, Megatron-LM, CUDA / GPU acceleration (practical understanding)
  • Strong experience with containerization and orchestration (Docker, Kubernetes, Helm, autoscaling).
  • Deep understanding of ML lifecycle workflows: training, fine-tuning, evaluation, inference, model registries.
  • Ability to lead technical strategy, collaborate cross-functionally, and operate in fast-paced environments
Benefits
  • Competitive salary & equity options
  • Sign-on bonus
  • Health, Dental, and Vision
  • 401k
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
ML Opsdistributed systemscloud infrastructurePythonTypeScriptGoPyTorchTransformersCUDAML lifecycle workflows
Soft Skills
leadershipcollaborationmentoringtechnical strategycross-functional collaborationfast-paced environment