
Machine Learning Operations Engineer
K1X
full-time
Posted on:
Location Type: Remote
Location: Illinois • United States
Visit company websiteExplore more
About the role
- Design and build scalable ML infrastructure to support model training, evaluation, and deployment.
- Develop and maintain containerized environments using Docker and Kubernetes.
- Build and manage distributed training pipelines and orchestration workflows.
- Implement and maintain ML lifecycle tooling such as MLflow for experiment tracking and reproducibility.
- Own production inference systems, including NVIDIA Triton Inference Server.
- Design and operate low-latency, high-availability model serving architectures.
- Implement CI/CD pipelines for ML deployment, versioning, and rollback strategies.
- Build and maintain data pipelines integrated with Snowflake and related data systems.
- Implement monitoring, logging, and alerting for model performance, drift detection, and system health.
- Partner with ML Engineers to improve developer experience and accelerate delivery.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent experience.
- 5+ years of experience in software engineering, DevOps, or MLOps roles.
- Strong proficiency in Python and experience building production-grade systems.
- Hands-on experience with Docker, Kubernetes, and distributed systems.
- Experience building and maintaining CI/CD pipelines.
- Familiarity with ML lifecycle tools such as MLflow or similar.
- Experience working with cloud-based data platforms such as Snowflake.
- Strong understanding of system design, APIs, and microservices architectures.
- Proven debugging and troubleshooting ability across distributed systems.
- Experience managing inference infrastructure such as NVIDIA Triton Inference Server.
- Experience building large-scale training infrastructure including GPU workloads and distributed training.
- Familiarity with feature stores, data versioning, and experiment tracking systems.
- Experience supporting NLP or document processing pipelines.
- Exposure to observability tools such as Prometheus, Grafana, or similar.
- Experience working in SaaS environments with high availability, productivity, and performance requirements.
- A strong bias toward automation, scalability, and continuous improvement.
- A collaborative mindset and ability to work cross-functionally with engineering and data teams.
Benefits
- Unlimited Vacation Policy + Sick Time
- Fully Remote Opportunity
- Benefits/401K
- Growing Startup Culture
- Unlimited Vacation Policy + Sick Time + Holidays
- Paid Parental Leave
- Fully Remote Opportunity
- Healthcare Benefits and 401K
- Growing Startup Culture
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
machine learning infrastructurePythonDockerKubernetesCI/CD pipelinesMLflowNVIDIA Triton Inference Serverdistributed systemsdata pipelinesGPU workloads
Soft Skills
collaborative mindsetdebuggingtroubleshootingstrong bias toward automationcontinuous improvementcross-functional teamworkstrong understanding of system designcommunicationorganizational skillsleadership
Certifications
Bachelor’s degree in Computer ScienceMaster’s degree in Engineering