Spotify

Senior Site Reliability Engineer

Spotify

full-time

Posted on:

Location Type: Remote

Location: New YorkUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $164,448 - $234,926 per year

Job Level

About the role

  • Own fleet reliability. Lead the reliability, security, and scalability strategy for Portal’s SaaS infrastructure, including the runtime environments that power our platform and LLM-driven agent workflows. Define SLOs, drive capacity planning, and ensure our systems meet the demands of a rapidly growing product.
  • Architect for the agentic era. Design and evolve infrastructure on GCP and AWS using Terraform and infrastructure-from-code patterns. Shape how we structure environments for non-deterministic AI workloads — including sandboxing, resource isolation, cost governance, and security boundaries.
  • Drive operational excellence. Evolve our incident management, on-call, and postmortem practices. Leverage AI assistants to accelerate root cause analysis and build increasingly self-healing capabilities into our production systems.
  • Lead fullstack reliability. Operate across a modern web stack (TypeScript, React, Python). While not frontend-heavy, you’ll diagnose and resolve issues across the stack and drive reliability improvements end-to-end.
  • Mentor and multiply. Raise the reliability IQ of the broader engineering team. Establish SRE best practices, conduct production-readiness reviews, and mentor engineers on operational thinking.
  • Shape the roadmap. Partner with engineering and product leadership to evolve our infrastructure in step with generative AI features. Translate operational insights into strategic input on the product roadmap.

Requirements

  • 5+ years of hands-on experience operating cloud infrastructure (GCP and/or AWS), using Terraform and Kubernetes to run production systems at scale.
  • practical experience — or a strong demonstrated interest — in operating LLM-based systems, RAG pipelines, or agentic workloads, and understand the reliability challenges of non-deterministic systems.
  • think in distributed systems first principles — consistency, availability, partition tolerance — and translate that thinking into pragmatic infrastructure decisions.
  • proficient in at least one modern language (TypeScript, Java, Go, or Python) and comfortable navigating large, heterogeneous codebases, including environments where AI-generated PRs are common.
  • build automation and improve systems so that whole categories of operational issues disappear over time.
  • communicate complex infrastructure trade-offs clearly to both technical and non-technical stakeholders, and write postmortems that lead to meaningful change.
Benefits
  • health insurance
  • six-month paid parental leave
  • 401(k) retirement plan
  • monthly meal allowance
  • 23 paid days off
  • paid flexible holidays
  • paid sick leave
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
cloud infrastructureTerraformKubernetesTypeScriptReactPythonLLM-based systemsRAG pipelinesdistributed systemsautomation
Soft Skills
mentoringcommunicationoperational thinkingincident managementpostmortem analysisstrategic inputcollaboration