Site Reliability Engineer II

Backblaze

Site Reliability Engineer II at Backblaze focusing on service stability and reliability through automation and incident response. Collaborating with teams to enhance operational efficiency.

Posted 5/28/2026full-timeRemote • 🇺🇸 United StatesJuniorMid-LevelWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

Linux administrationscripting (Python, Bash, Go)monitoringalertingincident responseroot cause analysisinfrastructure as codeautomationcapacity planningdisaster recovery

Soft Skills

collaborationproblem-solvingcommunicationdocumentationservice improvementreliability-focused practices

Tools & Technologies

PrometheusGrafanaCatchpointELKTerraformAnsibleJenkinsKubernetesDockerCI/CD pipelines

Industry Keywords

ITILOSSservice level indicators (SLIs)service level objectives (SLOs)error budgetsproduction environmentslarge-scale systemsmicroservicesoperational best practiceson-call rotations

Tech Stack

Tools & technologies

AnsibleDockerGoGrafanaJenkinsKubernetesLinuxMicroservicesPrometheusPythonTerraform

About the role

Key responsibilities & impact

Support the availability and durability of critical services across production environments.
Monitor service health using SLIs, SLOs, and error budgets, and escalate issues when thresholds are at risk.
Participate in on-call rotations, incident response, and post-incident reviews to drive service improvements.
Follow established ITIL/OSS processes (incident, change, problem, and capacity management).
Develop automation for common operational tasks, reducing manual intervention and toil.
Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint,ELK).
Work with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).
Write scripts (Bash, Python, Go, etc.) to improve system reliability and efficiency.
Partner with engineering, product, and operations teams to support resilient system design and operations.
Assist in capacity planning and disaster recovery exercises.
Work with vendors and service providers to troubleshoot service issues and track SLA performance.
Document systems, share learnings, and help grow a reliability-minded engineering culture.
Contribute to playbooks, runbooks, and operational documentation.
Identify recurring issues and propose long-term improvements.
Promote reliability-focused practices within development and operations teams.

Requirements

What you’ll need

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
2–4 years of experience in site reliability, systems engineering, or operations.
Exposure to large-scale, production-grade systems.
Solid Linux systems administration and troubleshooting skills.
Familiarity with service reliability concepts - monitoring, alerting, incident response, and root cause analysis.
Proficiency in at least one scripting language (Python, Bash, or Go).
Understanding of containers (Kubernetes, Docker) and microservices concepts.
Knowledge of incident response and operational best practices.

Benefits

Comp & perks

Flexible working hours
Professional development opportunities
Remote work options