Backblaze

Senior Site Reliability Engineer

Backblaze

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $150,000 - $200,000 per year

Job Level

About the role

  • Own and drive the availability, durability, and performance of critical services across all production environments.
  • Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership.
  • Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services.
  • Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes.
  • Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management).
  • Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform.
  • Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability.
  • Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins).
  • Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems.
  • Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation.
  • Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features.
  • Lead capacity planning and disaster recovery strategy across critical infrastructure components.
  • Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance.
  • Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams.
  • Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation.
  • Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans.
  • Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
  • 8+ years of progressive experience in site reliability, systems engineering, or operations.
  • Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.
  • Expert-level Linux systems administration and advanced troubleshooting skills.
  • Lead security-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification.
  • Deep mastery of service reliability concepts, including advanced monitoring, complex alerting strategy, leading incident response, and in-depth root cause analysis.
  • Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred).
  • Expert knowledge of incident response methodologies and operational best practices.
  • Proven experience designing and operating container orchestration (Kubernetes, Docker) and microservices concepts required.
  • Expert experience with Hashicorp products (Nomad, Vault, Terraform) in a production environment.
Benefits
  • Healthcare for family, including dental and vision
  • Competitive compensation and 401K
  • RSU grants for full-time employees
  • ESPP program
  • Flexible vacation policy
  • Maternity & paternity leave
  • MacBook Pro to use for work, plus a generous stipend to personalize your workstation
  • Childcare bonus (human children only)
  • Fertility treatment and support
  • Learning & development program
  • Commuter benefits
  • Culture that supports a healthy work-life balance
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Linux systems administrationscriptingprogrammingsite reliability engineeringdistributed systemsincident responsemonitoringalertingcontainer orchestrationmicroservices
Soft Skills
leadershipmentoringproblem-solvingcommunicationstrategic thinkingcollaborationdocumentationproactive identificationproject managementcross-functional teamwork