
Site Reliability Engineer – AI Infrastructure
Andromeda
full-time
Posted on:
Location Type: Remote
Location: California • United States
Visit company websiteExplore more
About the role
- Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers
- Build automation and tooling to streamline cluster deployments and integrations
- Debug customer issues across networking, storage, scheduling, and system layers
- Improve reliability and scalability of both training and inference infrastructure
- Design and implement monitoring, alerting, and observability for critical systems
- Collaborate with engineering and product teams to plan and deliver infrastructure for new services
- Participate in on-call and incident response, leading postmortems and reliability improvements
Requirements
- 5+ years experience in SRE, DevOps, or infrastructure engineering roles
- Strong Linux systems and networking fundamentals
- Deep experience with Kubernetes and container orchestration at scale
- Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.)
- Strong automation and scripting skills (Python, Go, or Bash)
- Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.)
- Track record of operating production systems and leading incident response
Benefits
- Ownership and autonomy to shape systems
- Opportunities to work directly with customers and providers
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesLinuxNetworkingInfrastructure-as-CodeTerraformHelmAnsiblePythonGoBash
Soft Skills
collaborationincident responsereliability improvementdebuggingproblem-solving