Netflix

Site Reliability Engineer 5, Core

Netflix

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Design, implement, and maintain scalable and reliable infrastructure to support Netflix Streaming Suite.
  • Collaborate with engineering and product teams to integrate observability, reliability, and security considerations into the entire software development lifecycle.
  • Develop and implement automation tools for monitoring, deployment, and incident response to ensure efficient and reliable operations.
  • Participate in on-call rotations to ensure the 24/7 health of the Netflix Streaming and contribute to incident response, diagnosis, and resolution.
  • Implement and maintain a robust incident response framework, including blame-aware incident reviews to learn from operational surprises.
  • Proactively identify sources of instability in distributed systems and analyze how complex systems fail from a reliability and resilience perspective.
  • Champion and embed a culture of reliability across the Ads organization.
  • Act as a force multiplier, scaling your technical expertise by creating clear documentation, developing best-practice guides, and building tooling to roll out reliability enhancements automatically.

Requirements

  • 5+ years of experience as a Site Reliability Engineer (SRE), Production Engineer, or similar role supporting business-critical, high-traffic services.
  • Write code to solve problems. You are proficient in one or more languages like Python, Go, or Java and believe in automating solutions over manual effort.
  • Are fluent in modern cloud infrastructure. You have hands-on experience with cloud providers such as AWS/Azure/GCP, Infrastructure as Code such as Terraform, and container orchestration systems like Kubernetes.
  • Understand large-scale distributed systems, their common failure modes and edge cases.
  • Thrive on collaboration and influence. You have excellent communication skills and a proven ability to build relationships with and educate engineering partners.
  • Experience with incident management and response, and are a natural troubleshooter.
  • You can calmly navigate complex production issues, identify root causes, and implement effective, lasting solutions.
  • Possess a growth mindset. You are relentlessly curious, committed to continuous improvement, and passionate about scaling your expertise.
  • Excellent communication & collaboration skills and a continuous improvement mindset.
  • Proven ability to cultivate relationships through influence.
Benefits
  • Inclusion is a Netflix value and we strive to host a meaningful interview experience for all candidates.
  • We are an equal-opportunity employer and celebrate diversity, recognizing that diversity builds stronger teams.
  • We approach diversity and inclusion seriously and thoughtfully.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Site Reliability EngineerProduction EngineerPythonGoJavaInfrastructure as CodeTerraformKubernetesdistributed systemsincident management
Soft Skills
communication skillscollaborationtroubleshootinginfluencerelationship buildingcontinuous improvementgrowth mindsetcuriosityproblem-solvingdocumentation