FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Senior Manager, Cloud Platform – Site Reliability
BasetenSenior Manager leading Cloud Platform and Site Reliability Engineering at Baseten, enhancing AI model production and infrastructure strategies.
Posted 5/18/2026full-timeSan Francisco • California • 🇺🇸 United StatesSenior💰 $165,000 - $330,000 per yearWebsite
Tech Stack
Tools & technologiesCloudDistributed SystemsFluxGrafanaJenkinsKubernetesPrometheusTerraform
About the role
Key responsibilities & impact- Lead, grow, and develop team leads across the Cloud Platform and Site Reliability Engineering orgs, building a culture of ownership, technical excellence, and continuous improvement.
- Set the technical direction and roadmap for infrastructure, reliability, and platform engineering at the org level — balancing near-term operational needs with long-term strategic investments.
- Own the reliability posture of the platform end-to-end, establishing and enforcing org-wide standards for SLOs/SLIs, incident response, observability-as-code, runbooks, and post-incident reviews.
- Drive cross-functional collaboration with product, engineering, and customer-facing teams to ensure infrastructure capabilities and reliability investments align with product goals and enterprise customer requirements.
- Oversee incident management and escalation processes for high-severity production issues, ensuring clear communication, rapid resolution, and systemic follow-through.
- Translate recurring operational pain points and customer feedback into roadmap priorities, product improvements, and runbook enhancements across both teams.
- Ensure best practices for CI/CD, infrastructure-as-code, GitOps, Kubernetes, and cloud resource management are consistently adopted and maintained across the org.
- Partner with forward-deployed and customer success teams to support enterprise accounts with strict SLAs and complex infrastructure requirements.
- Navigate ambiguity and make sound architectural and organizational tradeoffs, avoiding unnecessary complexity while enabling your teams to move fast.
- Demonstrate accountability, pride of ownership, and high standards — and expect the same from your leads and their teams.
Requirements
What you’ll need- Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field.
- Proven experience managing managers and leading multiple high-performing infrastructure, platform, or SRE teams in a fast-paced, high-growth environment.
- Deep technical expertise in Kubernetes (multi-cloud across EKS, GKE, or similar), cloud infrastructure, and distributed systems, with the ability to engage credibly in architectural and operational decisions.
- Hands-on background with infrastructure-as-code (e.g., Terraform, Pulumi) and CI/CD tooling (e.g., GitHub Actions, GitLab CI, Jenkins); familiarity with GitOps workflows (e.g., Flux CD, ArgoCD, Helm).
- Strong foundation in observability tooling — metrics (Prometheus, VictoriaMetrics), logging (Loki, ELK), dashboards (Grafana), tracing (OpenTelemetry) — and a track record of raising reliability standards through SLOs, SLIs, and observability-as-code.
- Experience owning incident management and enterprise SLAs at scale, including executive-level communication during high-severity incidents and rigorous post-incident follow-through.
- Demonstrated ability to lead complex, multi-stakeholder technical initiatives from scoping through execution, balancing engineering excellence with pragmatic delivery.
- Strong communication skills with executive presence, capable of representing technical work clearly to both technical and non-technical audiences.
- No prior machine learning experience required, but should be open to learning about ML infrastructure and model serving.
Benefits
Comp & perks- Competitive compensation, including meaningful equity.
- 100% coverage of medical, dental, and vision insurance for employee and dependents
- Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
- Paid parental leave
- Fertility and family-building stipend through Carrot
- Company-facilitated 401(k)
- Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Kubernetescloud infrastructuredistributed systemsinfrastructure-as-codeTerraformPulumiCI/CDGitHub ActionsGitLab CIJenkins
Soft Skills
leadershipcommunicationaccountabilitycollaborationproblem-solvingstrategic thinkingadaptabilitytechnical excellencecustomer focusorganizational skills