Site Reliability Engineer

Baseten

Site Reliability Engineer at Baseten ensuring reliability of multi-cloud Kubernetes infrastructure. Collaborating with engineering teams to enhance operational efficiency and system resilience.

Posted 5/12/2026full-timeSan Francisco • California • 🇺🇸 United StatesMid-LevelSenior💰 $135,000 - $285,000 per yearWebsite

Tech Stack

Tools & technologies

CloudFluxGrafanaKubernetesPrometheusTerraform

About the role

Key responsibilities & impact

Own the reliability of Baseten's multi-cloud Kubernetes infrastructure, including incident response, post-mortems, and remediation tracking.
Build and maintain observability infrastructure — metrics, logging, dashboards, and alerting — as code.
Author, validate, and improve runbooks for recurring failure patterns, ensuring they're structured for low-context, safe execution.
Identify high-frequency failure patterns and convert them into automated mitigations or self-healing automations.
Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management.
Define and instrument SLOs and SLIs across customer workloads and internal services.
Navigate ambiguity, make principled tradeoffs, and avoid unnecessary complexity in the systems you build and the processes you define.

Requirements

What you’ll need

Extensive hands-on experience with Kubernetes (multi-cloud experience across EKS, GKE, or similar is a strong plus).
Experience in building and maintaining scalable infrastructure.
Strong foundation in observability tooling: metrics (VictoriaMetrics, Prometheus), logging (Loki, ELK), dashboards (Grafana), and alerting pipelines. Observability-as-code experience is a plus.
Experience with infrastructure-as-code (Terraform, Helm) and GitOps workflows (Flux CD, ArgoCD).
Experience writing and improving runbooks, leading incident response, and doing post-mortem analysis.
Comfort working at the intersection of engineering and operations — you write code, but you also think deeply about process, escalation paths, and operational leverage.
Familiarity with incident management platforms (incident.io or similar) is a plus.
No prior ML experience required, but curiosity about how ML models are deployed and served at scale will serve you well.

Benefits

Comp & perks

Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Fertility and family-building stipend through Carrot
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Kubernetesobservability toolingmetricsloggingdashboardsalertinginfrastructure-as-codeTerraformHelmGitOps

Soft Skills

incident responsepost-mortem analysisprocess thinkingescalation pathsoperational leveragenavigating ambiguitymaking principled tradeoffsavoiding complexity