Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Backblaze

Site Reliability Engineer II

Backblaze

Site Reliability Engineer II at Backblaze focusing on service stability and reliability through automation and incident response. Collaborating with teams to enhance operational efficiency.

Posted 5/28/2026full-timeRemote • 🇺🇸 United StatesJuniorMid-LevelWebsite

Tech Stack

Tools & technologies
AnsibleDockerGoGrafanaJenkinsKubernetesLinuxMicroservicesPrometheusPythonTerraform

About the role

Key responsibilities & impact
  • Support the availability and durability of critical services across production environments.
  • Monitor service health using SLIs, SLOs, and error budgets, and escalate issues when thresholds are at risk.
  • Participate in on-call rotations, incident response, and post-incident reviews to drive service improvements.
  • Follow established ITIL/OSS processes (incident, change, problem, and capacity management).
  • Develop automation for common operational tasks, reducing manual intervention and toil.
  • Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint,ELK).
  • Work with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).
  • Write scripts (Bash, Python, Go, etc.) to improve system reliability and efficiency.
  • Partner with engineering, product, and operations teams to support resilient system design and operations.
  • Assist in capacity planning and disaster recovery exercises.
  • Work with vendors and service providers to troubleshoot service issues and track SLA performance.
  • Document systems, share learnings, and help grow a reliability-minded engineering culture.
  • Contribute to playbooks, runbooks, and operational documentation.
  • Identify recurring issues and propose long-term improvements.
  • Promote reliability-focused practices within development and operations teams.

Requirements

What you’ll need
  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
  • 2–4 years of experience in site reliability, systems engineering, or operations.
  • Exposure to large-scale, production-grade systems.
  • Solid Linux systems administration and troubleshooting skills.
  • Familiarity with service reliability concepts - monitoring, alerting, incident response, and root cause analysis.
  • Proficiency in at least one scripting language (Python, Bash, or Go).
  • Understanding of containers (Kubernetes, Docker) and microservices concepts.
  • Knowledge of incident response and operational best practices.

Benefits

Comp & perks
  • Flexible working hours
  • Professional development opportunities
  • Remote work options

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Linux administrationscripting (Python, Bash, Go)monitoringalertingincident responseroot cause analysisinfrastructure as codeautomationcapacity planningdisaster recovery
Soft Skills
collaborationproblem-solvingcommunicationdocumentationservice improvementreliability-focused practices