
Site Reliability Engineer II
Backblaze
full-time
Posted on:
Location Type: Remote
Location: India
Visit company websiteExplore more
About the role
- Support the availability and durability of critical services across production environments.
- Monitor service health using SLIs, SLOs, and error budgets, and escalate issues when thresholds are at risk.
- Participate in on-call rotations, incident response, and post-incident reviews to drive service improvements.
- Follow established ITIL/OSS processes (incident, change, problem, and capacity management).
- Develop automation for common operational tasks, reducing manual intervention and toil.
- Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint,ELK).
- Work with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).
- Write scripts (Bash, Python, Go, etc.) to improve system reliability and efficiency.
- Partner with engineering, product, and operations teams to support resilient system design and operations.
- Assist in capacity planning and disaster recovery exercises.
- Work with vendors and service providers to troubleshoot service issues and track SLA performance.
- Document systems, share learnings, and help grow a reliability-minded engineering culture.
- Contribute to playbooks, runbooks, and operational documentation.
- Identify recurring issues and propose long-term improvements.
- Promote reliability-focused practices within development and operations teams.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- 2–4 years of experience in site reliability, systems engineering, or operations.
- Exposure to large-scale, production-grade systems.
- Solid Linux systems administration and troubleshooting skills.
- Familiarity with service reliability concepts - monitoring, alerting, incident response, and root cause analysis.
- Proficiency in at least one scripting language (Python, Bash, or Go).
- Understanding of containers (Kubernetes, Docker) and microservices concepts.
- Knowledge of incident response and operational best practices.
Benefits
- Paid time off
- Professional development opportunities
- Remote work options
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Linux administrationscripting (Python, Bash, Go)monitoringalertingincident responseroot cause analysisinfrastructure as codeautomationcapacity planningdisaster recovery
Soft Skills
collaborationproblem-solvingcommunicationdocumentationservice improvementreliability-focused practices