Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Baseten

Senior Manager, Cloud Platform – Site Reliability

Baseten

Senior Manager leading Cloud Platform and Site Reliability Engineering at Baseten, enhancing AI model production and infrastructure strategies.

Posted 5/18/2026full-timeSan Francisco • California • 🇺🇸 United StatesSenior💰 $165,000 - $330,000 per yearWebsite

Tech Stack

Tools & technologies
CloudDistributed SystemsFluxGrafanaJenkinsKubernetesPrometheusTerraform

About the role

Key responsibilities & impact
  • Lead, grow, and develop team leads across the Cloud Platform and Site Reliability Engineering orgs, building a culture of ownership, technical excellence, and continuous improvement.
  • Set the technical direction and roadmap for infrastructure, reliability, and platform engineering at the org level — balancing near-term operational needs with long-term strategic investments.
  • Own the reliability posture of the platform end-to-end, establishing and enforcing org-wide standards for SLOs/SLIs, incident response, observability-as-code, runbooks, and post-incident reviews.
  • Drive cross-functional collaboration with product, engineering, and customer-facing teams to ensure infrastructure capabilities and reliability investments align with product goals and enterprise customer requirements.
  • Oversee incident management and escalation processes for high-severity production issues, ensuring clear communication, rapid resolution, and systemic follow-through.
  • Translate recurring operational pain points and customer feedback into roadmap priorities, product improvements, and runbook enhancements across both teams.
  • Ensure best practices for CI/CD, infrastructure-as-code, GitOps, Kubernetes, and cloud resource management are consistently adopted and maintained across the org.
  • Partner with forward-deployed and customer success teams to support enterprise accounts with strict SLAs and complex infrastructure requirements.
  • Navigate ambiguity and make sound architectural and organizational tradeoffs, avoiding unnecessary complexity while enabling your teams to move fast.
  • Demonstrate accountability, pride of ownership, and high standards — and expect the same from your leads and their teams.

Requirements

What you’ll need
  • Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field.
  • Proven experience managing managers and leading multiple high-performing infrastructure, platform, or SRE teams in a fast-paced, high-growth environment.
  • Deep technical expertise in Kubernetes (multi-cloud across EKS, GKE, or similar), cloud infrastructure, and distributed systems, with the ability to engage credibly in architectural and operational decisions.
  • Hands-on background with infrastructure-as-code (e.g., Terraform, Pulumi) and CI/CD tooling (e.g., GitHub Actions, GitLab CI, Jenkins); familiarity with GitOps workflows (e.g., Flux CD, ArgoCD, Helm).
  • Strong foundation in observability tooling — metrics (Prometheus, VictoriaMetrics), logging (Loki, ELK), dashboards (Grafana), tracing (OpenTelemetry) — and a track record of raising reliability standards through SLOs, SLIs, and observability-as-code.
  • Experience owning incident management and enterprise SLAs at scale, including executive-level communication during high-severity incidents and rigorous post-incident follow-through.
  • Demonstrated ability to lead complex, multi-stakeholder technical initiatives from scoping through execution, balancing engineering excellence with pragmatic delivery.
  • Strong communication skills with executive presence, capable of representing technical work clearly to both technical and non-technical audiences.
  • No prior machine learning experience required, but should be open to learning about ML infrastructure and model serving.

Benefits

Comp & perks
  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employee and dependents
  • Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
  • Paid parental leave
  • Fertility and family-building stipend through Carrot
  • Company-facilitated 401(k)
  • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Kubernetescloud infrastructuredistributed systemsinfrastructure-as-codeTerraformPulumiCI/CDGitHub ActionsGitLab CIJenkins
Soft Skills
leadershipcommunicationaccountabilitycollaborationproblem-solvingstrategic thinkingadaptabilitytechnical excellencecustomer focusorganizational skills