
Senior DevOps Engineer – ML Infrastructure
Serve Robotics
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇨🇦 Canada
Visit company websiteSalary
💰 CA$155,000 - CA$195,000 per year
Job Level
Senior
Tech Stack
AWSAzureCloudDockerGoogle Cloud PlatformJenkinsKubernetesPythonSQLTerraform
About the role
- Deploy and maintain our ML training orchestration system that operates across multiple platforms.
- Manage cloud and on-premise environments for large-scale distributed data processing and ml training/inference systems.
- Automate deployment pipelines, monitoring, and alerting for ML and data services.
- Collaborate closely with data scientists, ML engineers, and autonomy teams to streamline experimentation and model deployment.
- Maintain and improve CI/CD systems to support rapid development and testing.
- Implement best practices for system security, reliability, and observability.
- Optimize infrastructure costs and ensure efficient resource utilization.
- Support internal developer productivity through tooling, documentation, and support.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent experience.
- 5+ years of experience as a DevOps, SRE, or Infrastructure Engineer, preferably supporting ML or data-intensive systems.
- Strong experience with cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes, Docker).
- Proficiency in infrastructure-as-code tools such as Terraform or Helm.
- Solid understanding of CI/CD systems (GitLab CI, Jenkins, ArgoCD, etc.).
- Experience with Python and SQL
- Experience with cloud security, IAM (Identity and Access Management), and access control
- Experience analysing and optimizing hardware performance
- Experience with GPU cluster management
Benefits
- Offers Equity 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
ML training orchestrationcloud environmentsdistributed data processingautomationCI/CD systemsinfrastructure-as-codePythonSQLGPU cluster managementcloud security
Soft skills
collaborationstreamlining experimentationsupporting internal developer productivity
Certifications
Bachelor’s degree in Computer ScienceMaster’s degree in Engineering