
Senior Site Reliability Engineer
Spotify
full-time
Posted on:
Location Type: Remote
Location: New York • United States
Visit company websiteExplore more
Salary
💰 $164,448 - $234,926 per year
Job Level
Tech Stack
About the role
- Own fleet reliability. Lead the reliability, security, and scalability strategy for Portal’s SaaS infrastructure, including the runtime environments that power our platform and LLM-driven agent workflows. Define SLOs, drive capacity planning, and ensure our systems meet the demands of a rapidly growing product.
- Architect for the agentic era. Design and evolve infrastructure on GCP and AWS using Terraform and infrastructure-from-code patterns. Shape how we structure environments for non-deterministic AI workloads — including sandboxing, resource isolation, cost governance, and security boundaries.
- Drive operational excellence. Evolve our incident management, on-call, and postmortem practices. Leverage AI assistants to accelerate root cause analysis and build increasingly self-healing capabilities into our production systems.
- Lead fullstack reliability. Operate across a modern web stack (TypeScript, React, Python). While not frontend-heavy, you’ll diagnose and resolve issues across the stack and drive reliability improvements end-to-end.
- Mentor and multiply. Raise the reliability IQ of the broader engineering team. Establish SRE best practices, conduct production-readiness reviews, and mentor engineers on operational thinking.
- Shape the roadmap. Partner with engineering and product leadership to evolve our infrastructure in step with generative AI features. Translate operational insights into strategic input on the product roadmap.
Requirements
- 5+ years of hands-on experience operating cloud infrastructure (GCP and/or AWS), using Terraform and Kubernetes to run production systems at scale.
- practical experience — or a strong demonstrated interest — in operating LLM-based systems, RAG pipelines, or agentic workloads, and understand the reliability challenges of non-deterministic systems.
- think in distributed systems first principles — consistency, availability, partition tolerance — and translate that thinking into pragmatic infrastructure decisions.
- proficient in at least one modern language (TypeScript, Java, Go, or Python) and comfortable navigating large, heterogeneous codebases, including environments where AI-generated PRs are common.
- build automation and improve systems so that whole categories of operational issues disappear over time.
- communicate complex infrastructure trade-offs clearly to both technical and non-technical stakeholders, and write postmortems that lead to meaningful change.
Benefits
- health insurance
- six-month paid parental leave
- 401(k) retirement plan
- monthly meal allowance
- 23 paid days off
- paid flexible holidays
- paid sick leave
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
cloud infrastructureTerraformKubernetesTypeScriptReactPythonLLM-based systemsRAG pipelinesdistributed systemsautomation
Soft Skills
mentoringcommunicationoperational thinkingincident managementpostmortem analysisstrategic inputcollaboration