FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAWSAzureGoGoogle Cloud PlatformGrafanaLinuxPrometheusPython
About the role
Key responsibilities & impact- Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
- Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems
- Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets
- Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses
- Partner with Engineering and partner teams to conduct operability reviews
Requirements
What you’ll need- 8+ years of experience managing reliability, scalability, and availability for large-scale production services
- Deep expertise in programming (e.g., Python, Go, or C/C++)
- Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture
- Experience in high-stakes incident management and participation in a 24/7 on-call rotation
- Proficiency in leveraging ITIL frameworks and incident data to drive service maturity through systematic problem management and technical operability reviews
Benefits
Comp & perks- Various health plans
- Time off plans for vacation and sick time
- Parental leave options
- Retirement options
- Education reimbursement
- In-office perks, and more!
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonGoC/C++AWSAzureGCPPrometheusGrafanaOpenTelemetryLinux
Soft Skills
leadershipincident managementcommunicationcollaborationproblem management
