FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAWSDockerEC2GoKubernetes.NETOraclePostgresPythonTerraform
About the role
Key responsibilities & impact- **Incident Leadership:**
- Act as Incident Response Lead in War Rooms, coordinating technical remediation and communication with stakeholders.
- **Observability Engineering:**
- Design and evolve telemetry in Datadog (Logs, APM, Traces and business metrics) to reduce MTTD and the team's cognitive load.
- **Workload Management on AWS Amplify:**
- Ensure the resilience and scalability of hosted front-end applications and critical APIs.
- **SRE Governance:**
- Define and monitor SLIs, SLOs and SLAs, managing the Error Budget to balance delivery speed with stability.
- **Mitigation Automation:**
- Develop auto-healing tools and scripts (automatic rollback, controlled restart, component isolation).
- **Root Cause Analysis:**
- Lead blameless post-mortem processes and ensure the implementation of structural improvements to prevent recurrence.
- **Systems Modernization:**
- Work with development teams to implement resilience patterns (Circuit Breakers, Bulkheads and Rate Limiting) in both modern architectures and legacy systems.
- **AI in Operations:**
- Implement anomaly detection and intelligent response solutions using AIOps (Datadog Bits AI or AWS DevOps Agent).
Requirements
What you’ll need- **Proven Seniority in SRE or DevOps:** Solid experience in high-scale, mission-critical environments.
- **Deep AWS Expertise:** Advanced experience with EC2, RDS, S3, IAM, EKS and Amplify.
- **Observability Tools:** Strong experience in monitoring, logging and APM (preferably using Datadog).
- **Containers & Orchestration:** Strong knowledge of Docker and Kubernetes (EKS/GKE).
- **Infrastructure as Code (IaC):** Proficiency in Terraform.
- **Development/Scripting:** Proficient in Python, Go or Shell scripting for automation.
- **Incident Management:** Real experience with on-call rotations and real-time problem resolution.
- **Plus / Nice-to-haves:**
- **Analytical Profile for Legacy Systems:** Experience troubleshooting .NET Framework applications and Oracle or PostgreSQL databases.
- **Chaos Engineering:** Experience executing controlled stress and resilience tests.
- **Certifications:** AWS Certified DevOps Engineer - Professional or official Datadog certifications.
Benefits
Comp & perks- 📚 Educational Incentives (Partnerships with Educational Institutions)
- 🌴 Paid Vacation
- 🏋️ TotalPass
- 🎂 Birthday off
- 🏥 Health Insurance
- 🦷 Dental Insurance
- 🤰 Maternity Leave
- 👨👩👧👦 Paternity Leave
- 🌟 Reimbursement for AWS Certifications
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Incident ResponseTelemetry DesignAWS AmplifySLIsSLOsSLAsAuto-healing ToolsPythonGoTerraform
Soft Skills
LeadershipCommunicationAnalytical ThinkingProblem Resolution
Certifications
AWS Certified DevOps Engineer - ProfessionalDatadog Certifications
