Site Reliability Engineer – AI Infrastructure

Andromeda

full-time

Posted on: 2/27/2026

Location Type: Remote

✨ AI Apply

About the role

Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers
Build automation and tooling to streamline cluster deployments and integrations
Debug customer issues across networking, storage, scheduling, and system layers
Improve reliability and scalability of both training and inference infrastructure
Design and implement monitoring, alerting, and observability for critical systems
Collaborate with engineering and product teams to plan and deliver infrastructure for new services
Participate in on-call and incident response, leading postmortems and reliability improvements

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

KubernetesLinuxNetworkingInfrastructure-as-CodeTerraformHelmAnsiblePythonGoBash

Soft Skills

collaborationincident responsereliability improvementdebuggingproblem-solving