
Staff ML Ops Engineer
Albert Invent
full-time
Posted on:
Location Type: Remote
Location: California • United States
Visit company websiteExplore more
Job Level
About the role
- Design, deploy, and maintain Kubernetes infrastructure supporting AI/ML workloads
- Manage containerized services, autoscaling, networking, and resource optimization
- Design and build high-performance Python APIs and services using FastAPI or similar frameworks
- Architect backend systems for scalability, reliability, and low latency
- Build integrations between AI/ML systems and the broader Albert platform
- Build and operate distributed systems that handle compute-intensive and high-throughput workloads
- Design for fault tolerance, graceful degradation, and horizontal scalability
- Implement async workflows, job queues, and task orchestration as needed
- Architect and maintain data pipelines and storage systems supporting AI/ML workflows
- Implement observability including logging, metrics, tracing, and alerting
- Own system reliability—troubleshoot issues, conduct post-mortems, and continuously improve
- Design CI/CD pipelines and promote automation best practices
- Partner closely with ML engineers to understand requirements and deliver production-ready infrastructure
- Translate ML prototypes and research code into scalable, maintainable systems
Requirements
- A degree in Computer Science or a related field with 7+ years of industry experience (Bachelor's) or 5+ years (Master's or PhD) in software engineering
- Experience supporting AI/ML teams or deploying ML systems in production
- Experience with GPU workloads and scheduling
- Advanced proficiency in Python including async programming and performance optimization
- Deep experience with Kubernetes—cluster management, networking, autoscaling, and troubleshooting
- Strong background in distributed systems and microservices architecture
- Experience with cloud platforms (AWS, GCP, or Azure) and infrastructure-as-code
- Proficiency in REST API development using FastAPI, Flask, or similar
- Experience with containerization and CI/CD pipelines
- Track record of operating production systems at scale
Benefits
- Health insurance
- Flexible working hours
- Professional development opportunities
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesPythonFastAPIasync programmingCI/CDcontainerizationdistributed systemsmicroservices architecturedata pipelinesperformance optimization
Soft Skills
troubleshootingsystem reliabilitycollaborationcommunicationproblem-solvingcontinuous improvementscalability designfault tolerancetask orchestrationresource optimization
Certifications
Bachelor's degree in Computer ScienceMaster's degree in Computer SciencePhD in Computer Science