Principal MLOps Engineer

Raft

Principal ML Ops Engineer supporting customers at Raft, focusing on AI and data platforms for the Department of Defense. Designing and maintaining ML infrastructure and deployment pipelines.

Posted 4/20/2026full-timeRemote • Colorado, Florida, Hawaii, Massachusetts, Texas, Virginia • 🇺🇸 United StatesLead💰 $150,000 - $200,000 per yearWebsite

Tech Stack

Tools & technologies

AWSAzureCloudDockerKubernetesPython

About the role

Key responsibilities & impact

Design, build, and maintain secure, scalable MLOps infrastructure and deployment pipelines for production ML systems
Help mature Raft’s internal ML platform and model lifecycle capabilities, including model packaging, registry/catalog workflows, deployment, monitoring, and operational support
Deploy and manage machine learning workloads on Kubernetes, including GPU-enabled clusters
Support model serving and inference infrastructure for a range of ML use cases, including traditional ML, computer vision, speech/audio, and LLM-based systems
Build and maintain CI/CD workflows for ML services, model artifacts, and platform components
Partner closely with ML engineers, software engineers, and product teams to move models from experimentation to reliable operational deployment
Improve observability, reliability, security, and maintainability across ML infrastructure and services
Help evaluate and standardize runtime patterns, serving frameworks, and deployment architectures for production ML workloads
Contribute to infrastructure decisions across edge, on-prem, and cloud-hosted deployment environments
Support compliance-driven deployment practices and secure software supply chain requirements in defense environments
Get hands-on with customers at the most forward-leaning places in the Department of War

Requirements

What you’ll need

7+ years of relevant hands-on experience in software engineering, platform engineering, DevOps, MLOps, or related technical roles
5+ years of experience with Docker and Kubernetes in production environments
5+ years of experience supporting enterprise cloud infrastructure or applications in AWS, Azure, or similar environments
Strong experience provisioning, operating, and troubleshooting Kubernetes clusters in production
Experience building and maintaining machine learning platforms, infrastructure, or pipelines used by engineering or data science teams
Practical experience deploying machine learning workloads on Kubernetes
Experience managing clusters or workloads that use GPUs
Strong understanding of Helm and Kubernetes deployment patterns
Strong scripting or programming skills, preferably in Python
Experience with modern software engineering practices including Git, CI/CD, DevOps, and Agile/Scrum workflows
Strong troubleshooting, systems thinking, and communication skills
Ability to work independently and collaboratively in a fast-moving environment
Ability to obtain and maintain a Top Secret clearance
Ability to obtain Security+ certification within the first 90 days of employment.

Benefits

Comp & perks

Highly competitive salary
Fully covered healthcare, dental, and vision coverage
401(k) and company match
Take as you need PTO + 11 paid holidays
Education & training benefits
Annual budget for your tech/gadgets needs
Monthly box of yummy snacks to eat while doing meaningful work
Remote, hybrid, and flexible work options
Team off-site in fun places!
Generous Referral Bonuses
And More!

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

MLOpsKubernetesDockerAWSAzurePythonCI/CDDevOpsmachine learningHelm

Soft Skills

troubleshootingsystems thinkingcommunicationindependent workcollaborationadaptabilityproblem-solvingorganizational skillsleadershipcustomer engagement

Certifications

Top Secret clearanceSecurity+