FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Site Reliability Engineer II
BackblazeSite Reliability Engineer II at Backblaze ensuring stability, scalability, and reliability of services. Building automation, maintaining observability, and supporting incident response for customer-facing systems.
Tech Stack
Tools & technologiesAnsibleDockerGoGrafanaJenkinsKubernetesLinuxMicroservicesPrometheusPythonTerraform
About the role
Key responsibilities & impact- Support the availability and durability of critical services across production environments.
- Monitor service health using SLIs, SLOs, and error budgets, and escalate issues when thresholds are at risk.
- Participate in on-call rotations, incident response, and post-incident reviews to drive service improvements.
- Follow established ITIL/OSS processes (incident, change, problem, and capacity management).
- Develop automation for common operational tasks, reducing manual intervention and toil.
- Contribute to monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint,ELK).
- Work with CI/CD pipelines, configuration management, and infrastructure as code tools (Terraform, Ansible, Jenkins).
- Write scripts (Bash, Python, Go, etc.) to improve system reliability and efficiency.
- Partner with engineering, product, and operations teams to support resilient system design and operations.
- Assist in capacity planning and disaster recovery exercises.
- Work with vendors and service providers to troubleshoot service issues and track SLA performance.
- Document systems, share learnings, and help grow a reliability-minded engineering culture.
- Contribute to playbooks, runbooks, and operational documentation.
- Identify recurring issues and propose long-term improvements.
- Promote reliability-focused practices within development and operations teams.
Requirements
What you’ll need- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- 2–4 years of experience in site reliability, systems engineering, or operations.
- Exposure to large-scale, production-grade systems.
- Solid Linux systems administration and troubleshooting skills.
- Familiarity with service reliability concepts - monitoring, alerting, incident response, and root cause analysis.
- Proficiency in at least one scripting language (Python, Bash, or Go).
- Understanding of containers (Kubernetes, Docker) and microservices concepts.
- Knowledge of incident response and operational best practices.
Benefits
Comp & perks- Paid time off
- Professional development opportunities
- Remote work options
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Linux administrationscripting (Python, Bash, Go)monitoringalertingincident responseroot cause analysisinfrastructure as codeautomationcapacity planningdisaster recovery
Soft Skills
collaborationproblem-solvingcommunicationdocumentationservice improvementreliability-focused practices