Site Reliability Engineer II

Backblaze

Site Reliability Engineer II at Backblaze ensuring stability, scalability, and reliability of services. Building automation, maintaining observability, and supporting incident response for customer-facing systems.

Posted 4/1/2026full-timeRemote • 🇮🇳 IndiaJuniorMid-LevelWebsite

Tech Stack

Tools & technologies

AnsibleDockerGoGrafanaJenkinsKubernetesLinuxMicroservicesPrometheusPythonTerraform

About the role

Key responsibilities & impact

Support the availability and durability of critical services across production environments.
Monitor service health using SLIs, SLOs, and error budgets, and escalate issues when thresholds are at risk.
Participate in on-call rotations, incident response, and post-incident reviews to drive service improvements.
Follow established ITIL/OSS processes (incident, change, problem, and capacity management).
Develop automation for common operational tasks, reducing manual intervention and toil.
Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint,ELK).
Work with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).
Write scripts (Bash, Python, Go, etc.) to improve system reliability and efficiency.
Partner with engineering, product, and operations teams to support resilient system design and operations.
Assist in capacity planning and disaster recovery exercises.
Work with vendors and service providers to troubleshoot service issues and track SLA performance.
Document systems, share learnings, and help grow a reliability-minded engineering culture.
Contribute to playbooks, runbooks, and operational documentation.
Identify recurring issues and propose long-term improvements.
Promote reliability-focused practices within development and operations teams.

Requirements

What you’ll need

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
2–4 years of experience in site reliability, systems engineering, or operations.
Exposure to large-scale, production-grade systems.
Solid Linux systems administration and troubleshooting skills.
Familiarity with service reliability concepts - monitoring, alerting, incident response, and root cause analysis.
Proficiency in at least one scripting language (Python, Bash, or Go).
Understanding of containers (Kubernetes, Docker) and microservices concepts.
Knowledge of incident response and operational best practices.

Benefits

Comp & perks

Paid time off
Professional development opportunities
Remote work options

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Linux administrationscripting (Python, Bash, Go)monitoringalertingincident responseroot cause analysisinfrastructure as codeautomationcapacity planningdisaster recovery

Soft Skills

collaborationproblem-solvingcommunicationdocumentationservice improvementreliability-focused practices