FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Site Reliability Engineer
BasetenSite Reliability Engineer at Baseten ensuring reliability of multi-cloud Kubernetes infrastructure. Collaborating with engineering teams to enhance operational efficiency and system resilience.
Posted 5/12/2026full-timeSan Francisco • California • 🇺🇸 United StatesMid-LevelSenior💰 $135,000 - $285,000 per yearWebsite
Tech Stack
Tools & technologiesCloudFluxGrafanaKubernetesPrometheusTerraform
About the role
Key responsibilities & impact- Own the reliability of Baseten's multi-cloud Kubernetes infrastructure, including incident response, post-mortems, and remediation tracking.
- Build and maintain observability infrastructure — metrics, logging, dashboards, and alerting — as code.
- Author, validate, and improve runbooks for recurring failure patterns, ensuring they're structured for low-context, safe execution.
- Identify high-frequency failure patterns and convert them into automated mitigations or self-healing automations.
- Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management.
- Define and instrument SLOs and SLIs across customer workloads and internal services.
- Navigate ambiguity, make principled tradeoffs, and avoid unnecessary complexity in the systems you build and the processes you define.
Requirements
What you’ll need- Extensive hands-on experience with Kubernetes (multi-cloud experience across EKS, GKE, or similar is a strong plus).
- Experience in building and maintaining scalable infrastructure.
- Strong foundation in observability tooling: metrics (VictoriaMetrics, Prometheus), logging (Loki, ELK), dashboards (Grafana), and alerting pipelines. Observability-as-code experience is a plus.
- Experience with infrastructure-as-code (Terraform, Helm) and GitOps workflows (Flux CD, ArgoCD).
- Experience writing and improving runbooks, leading incident response, and doing post-mortem analysis.
- Comfort working at the intersection of engineering and operations — you write code, but you also think deeply about process, escalation paths, and operational leverage.
- Familiarity with incident management platforms (incident.io or similar) is a plus.
- No prior ML experience required, but curiosity about how ML models are deployed and served at scale will serve you well.
Benefits
Comp & perks- Competitive compensation, including meaningful equity.
- 100% coverage of medical, dental, and vision insurance for employee and dependents
- Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
- Paid parental leave
- Fertility and family-building stipend through Carrot
- Company-facilitated 401(k)
- Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Kubernetesobservability toolingmetricsloggingdashboardsalertinginfrastructure-as-codeTerraformHelmGitOps
Soft Skills
incident responsepost-mortem analysisprocess thinkingescalation pathsoperational leveragenavigating ambiguitymaking principled tradeoffsavoiding complexity