Andromeda

Site Reliability Engineer – AI Infrastructure

Andromeda

full-time

Posted on:

Location Type: Remote

Location: CaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers
  • Build automation and tooling to streamline cluster deployments and integrations
  • Debug customer issues across networking, storage, scheduling, and system layers
  • Improve reliability and scalability of both training and inference infrastructure
  • Design and implement monitoring, alerting, and observability for critical systems
  • Collaborate with engineering and product teams to plan and deliver infrastructure for new services
  • Participate in on-call and incident response, leading postmortems and reliability improvements

Requirements

  • 5+ years experience in SRE, DevOps, or infrastructure engineering roles
  • Strong Linux systems and networking fundamentals
  • Deep experience with Kubernetes and container orchestration at scale
  • Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.)
  • Strong automation and scripting skills (Python, Go, or Bash)
  • Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.)
  • Track record of operating production systems and leading incident response
Benefits
  • Ownership and autonomy to shape systems
  • Opportunities to work directly with customers and providers
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
KubernetesLinuxNetworkingInfrastructure-as-CodeTerraformHelmAnsiblePythonGoBash
Soft Skills
collaborationincident responsereliability improvementdebuggingproblem-solving