Backblaze

Site Reliability Engineer II

Backblaze

full-time

Posted on:

Location Type: Remote

Location: India

Visit company website

Explore more

AI Apply
Apply

About the role

  • Support the availability and durability of critical services across production environments.
  • Monitor service health using SLIs, SLOs, and error budgets, and escalate issues when thresholds are at risk.
  • Participate in on-call rotations, incident response, and post-incident reviews to drive service improvements.
  • Follow established ITIL/OSS processes (incident, change, problem, and capacity management).
  • Develop automation for common operational tasks, reducing manual intervention and toil.
  • Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint,ELK).
  • Work with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).
  • Write scripts (Bash, Python, Go, etc.) to improve system reliability and efficiency.
  • Partner with engineering, product, and operations teams to support resilient system design and operations.
  • Assist in capacity planning and disaster recovery exercises.
  • Work with vendors and service providers to troubleshoot service issues and track SLA performance.
  • Document systems, share learnings, and help grow a reliability-minded engineering culture.
  • Contribute to playbooks, runbooks, and operational documentation.
  • Identify recurring issues and propose long-term improvements.
  • Promote reliability-focused practices within development and operations teams.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
  • 2–4 years of experience in site reliability, systems engineering, or operations.
  • Exposure to large-scale, production-grade systems.
  • Solid Linux systems administration and troubleshooting skills.
  • Familiarity with service reliability concepts - monitoring, alerting, incident response, and root cause analysis.
  • Proficiency in at least one scripting language (Python, Bash, or Go).
  • Understanding of containers (Kubernetes, Docker) and microservices concepts.
  • Knowledge of incident response and operational best practices.
Benefits
  • Paid time off
  • Professional development opportunities
  • Remote work options
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Linux administrationscripting (Python, Bash, Go)monitoringalertingincident responseroot cause analysisinfrastructure as codeautomationcapacity planningdisaster recovery
Soft Skills
collaborationproblem-solvingcommunicationdocumentationservice improvementreliability-focused practices