
Senior Site Reliability Engineer
Backblaze
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $150,000 - $200,000 per year
Job Level
Tech Stack
About the role
- Own and drive the availability, durability, and performance of critical services across all production environments.
- Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership.
- Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services.
- Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes.
- Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management).
- Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform.
- Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability.
- Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins).
- Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems.
- Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation.
- Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features.
- Lead capacity planning and disaster recovery strategy across critical infrastructure components.
- Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance.
- Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams.
- Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation.
- Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans.
- Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- 8+ years of progressive experience in site reliability, systems engineering, or operations.
- Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.
- Expert-level Linux systems administration and advanced troubleshooting skills.
- Lead security-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification.
- Deep mastery of service reliability concepts, including advanced monitoring, complex alerting strategy, leading incident response, and in-depth root cause analysis.
- Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred).
- Expert knowledge of incident response methodologies and operational best practices.
- Proven experience designing and operating container orchestration (Kubernetes, Docker) and microservices concepts required.
- Expert experience with Hashicorp products (Nomad, Vault, Terraform) in a production environment.
Benefits
- Healthcare for family, including dental and vision
- Competitive compensation and 401K
- RSU grants for full-time employees
- ESPP program
- Flexible vacation policy
- Maternity & paternity leave
- MacBook Pro to use for work, plus a generous stipend to personalize your workstation
- Childcare bonus (human children only)
- Fertility treatment and support
- Learning & development program
- Commuter benefits
- Culture that supports a healthy work-life balance
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Linux systems administrationscriptingprogrammingsite reliability engineeringdistributed systemsincident responsemonitoringalertingcontainer orchestrationmicroservices
Soft Skills
leadershipmentoringproblem-solvingcommunicationstrategic thinkingcollaborationdocumentationproactive identificationproject managementcross-functional teamwork