The Voleon Group

Senior Site Reliability Engineer

The Voleon Group

full-time

Posted on:

Location Type: Remote

Location: Remote • California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $205,000 - $235,000 per year

Job Level

Senior

Tech Stack

AnsibleAWSCloudGoogle Cloud PlatformGrafanaPrometheusPythonRubyTerraform

About the role

  • Help scale research compute cluster to meet growing needs.
  • Leverage engineering skills to ensure high degrees of uptime, reliability, and robustness.
  • Responsible for keeping research clusters available and performant.
  • Provide a world-class HPC platform for researchers focusing on machine learning problems at scale.
  • Support both on-prem and cloud infrastructure, ensuring best experiences for technical staff.
  • Collaborate with engineering teams to develop monitoring and telemetry improvements.
  • Design and oversee operational frameworks to ensure cluster operations meet SLAs.

Requirements

  • 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead.
  • Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod).
  • Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
  • Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible).
  • Experience with cloud infrastructure (AWS or GCP).
  • Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry).
  • Experience with distributed storage technologies (Lustre, Ceph, S3).
  • Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation.
  • Bachelor degree in computer science or equivalent experience.
Benefits
  • medical, dental and vision coverage
  • life and AD&D insurance
  • 20 days of paid time off
  • 9 sick days
  • 401(k) plan with a company match
  • “Friends of Voleon” Candidate Referral Program

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
HPCbatch compute frameworksmachine learning training systemsPythonRubyinfrastructure-as-codeconfiguration managementcloud infrastructureobservability stacksdistributed storage technologies
Soft skills
system engineer mindsetautomationcollaborationreliabilityrobustnessuptimeperformanceoperational frameworks
Certifications
Bachelor degree in computer science
Catio

Senior SRE

Catio
Seniorfull-time🇺🇸 United States
Posted: 1 hour agoSource: jobs.ashbyhq.com
AWSCloudGrafanaKubernetesPrometheusSplunkTerraform
Hypergiant

Intermediate DevOps Engineer

Hypergiant
Mid · Seniorfull-time$113k–$136k / year🇺🇸 United States
Posted: 3 hours agoSource: boards.greenhouse.io
AnsibleAWSCloudDockerFluxGoogle Cloud PlatformJavaScriptKubernetesNode.jsReactTerraformTypeScript
Domyn

Senior DevOps Engineer

Domyn
Seniorfull-time🇺🇸 United States
Posted: 5 hours agoSource: apply.workable.com
AWSAzureCloudDockerGoogle Cloud PlatformJavaJavaScriptKubernetesLinuxPostgresPythonTerraform
Acquisition.com

Senior DevOps Engineer

Acquisition.com
Seniorfull-time$171k–$209k / yearArizona, California, Florida, Maryland, Minnesota, Missouri, Nevada, Ohio, Oregon, Pennsylvania, Tennessee, Texas, Utah, Wisconsin · 🇺🇸 United States
Posted: 16 hours agoSource: jobs.ashbyhq.com
AWSCloudDockerGoKubernetesPrometheusPythonTerraformTypeScript