Salary
💰 $205,000 - $235,000 per year
Tech Stack
AnsibleAWSCloudGoogle Cloud PlatformGrafanaPrometheusPythonRubyTerraform
About the role
- Help scale research compute cluster to meet growing needs.
- Leverage engineering skills to ensure high degrees of uptime, reliability, and robustness.
- Responsible for keeping research clusters available and performant.
- Provide a world-class HPC platform for researchers focusing on machine learning problems at scale.
- Support both on-prem and cloud infrastructure, ensuring best experiences for technical staff.
- Collaborate with engineering teams to develop monitoring and telemetry improvements.
- Design and oversee operational frameworks to ensure cluster operations meet SLAs.
Requirements
- 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead.
- Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod).
- Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
- Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible).
- Experience with cloud infrastructure (AWS or GCP).
- Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry).
- Experience with distributed storage technologies (Lustre, Ceph, S3).
- Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation.
- Bachelor degree in computer science or equivalent experience.
- medical, dental and vision coverage
- life and AD&D insurance
- 20 days of paid time off
- 9 sick days
- 401(k) plan with a company match
- “Friends of Voleon” Candidate Referral Program
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
HPCbatch compute frameworksmachine learning training systemsPythonRubyinfrastructure-as-codeconfiguration managementcloud infrastructureobservability stacksdistributed storage technologies
Soft skills
system engineer mindsetautomationcollaborationreliabilityrobustnessuptimeperformanceoperational frameworks
Certifications
Bachelor degree in computer science