FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAWSAzureCloudDistributed SystemsDockerElasticSearchGoogle Cloud PlatformGrafanaKubernetesLinuxPython
About the role
Key responsibilities & impact- Define and implement SLIs, SLOs, and error budgets for critical CloudBlue services to ensure reliability and performance
- Influence system architecture with a strong focus on reliability, scalability, and operability, designing systems for fault tolerance, graceful degradation, and self-healing
- Reduce operational toil by identifying opportunities for automation and process improvement
- Design and operate CloudBlue’s observability stack across metrics, logs, and traces using tools such as Datadog, Grafana, and Elastic Stack
- Develop actionable alerting strategies and dashboards that provide clear insight into platform and business health
- Design and maintain high-availability architectures, implementing redundancy, failover, and disaster recovery strategies across regions and availability zones
- Conduct capacity planning, load testing, and performance optimization to ensure platform stability and scalability
- Act as a senior responder during production incidents, leading incident coordination, communication, and service restoration
- Own blameless postmortems and drive improvements that reduce incident frequency, MTTR, and customer impact
- Improve reliability of Kubernetes-based platforms through health checks, autoscaling strategies, rollout safety, and resilience testing
- Partner with engineering and DevOps teams to improve deployment safety, rollback strategies, and platform reliability
- Maintain runbooks and operational documentation, and promote SRE best practices across engineering teams
- Support other tasks or projects as assigned to meet team and business needs
Requirements
What you’ll need- 3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, with strong ownership of production systems
- Proven experience operating highly available, enterprise-grade, multi-tenant SaaS platforms
- Hands-on experience with observability and monitoring tools such as Datadog, Grafana, and Elasticsearch/Kibana
- Solid understanding of Linux, networking, and distributed systems fundamentals
- Experience working with containerized environments such as Docker and Kubernetes
- Strong scripting and automation skills using Python and/or Bash
- Experience participating in on-call rotations and incident response in production environments
- Strong written and spoken English
- Experience defining SLIs/SLOs and managing error budgets at scale will be considered a plus
- Exposure to hyperscale or service-provider-grade platforms is an advantage
- Cloud experience, preferably with Azure; experience with AWS and/or GCP will also be valued
- Experience working with hybrid or on-premises integrations is beneficial
- Familiarity with chaos engineering and resilience testing will be considered an asset
Benefits
Comp & perks- A competitive salary that values you and your unique skill sets
- Career advancement & professional development opportunities to help you reach your full potential
- Flexible work arrangements to support work/life balance
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
SLIsSLOserror budgetsKubernetesDockerPythonBashLinuxnetworkingdistributed systems
Soft Skills
incident coordinationcommunicationownershipprocess improvementleadershipcollaborationdocumentationblameless postmortemsscriptingautomation
