Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
HostPapa

Site Reliability Engineer

HostPapa

. Define and implement SLIs, SLOs, and error budgets for critical CloudBlue services to ensure reliability and performance .

Posted 4/22/2026full-timeRemote • 🇲🇾 MalaysiaMid-LevelSeniorWebsite

Tech Stack

Tools & technologies
AWSAzureCloudDistributed SystemsDockerElasticSearchGoogle Cloud PlatformGrafanaKubernetesLinuxPython

About the role

Key responsibilities & impact
  • Define and implement SLIs, SLOs, and error budgets for critical CloudBlue services to ensure reliability and performance
  • Influence system architecture with a strong focus on reliability, scalability, and operability, designing systems for fault tolerance, graceful degradation, and self-healing
  • Reduce operational toil by identifying opportunities for automation and process improvement
  • Design and operate CloudBlue’s observability stack across metrics, logs, and traces using tools such as Datadog, Grafana, and Elastic Stack
  • Develop actionable alerting strategies and dashboards that provide clear insight into platform and business health
  • Design and maintain high-availability architectures, implementing redundancy, failover, and disaster recovery strategies across regions and availability zones
  • Conduct capacity planning, load testing, and performance optimization to ensure platform stability and scalability
  • Act as a senior responder during production incidents, leading incident coordination, communication, and service restoration
  • Own blameless postmortems and drive improvements that reduce incident frequency, MTTR, and customer impact
  • Improve reliability of Kubernetes-based platforms through health checks, autoscaling strategies, rollout safety, and resilience testing
  • Partner with engineering and DevOps teams to improve deployment safety, rollback strategies, and platform reliability
  • Maintain runbooks and operational documentation, and promote SRE best practices across engineering teams
  • Support other tasks or projects as assigned to meet team and business needs

Requirements

What you’ll need
  • 3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, with strong ownership of production systems
  • Proven experience operating highly available, enterprise-grade, multi-tenant SaaS platforms
  • Hands-on experience with observability and monitoring tools such as Datadog, Grafana, and Elasticsearch/Kibana
  • Solid understanding of Linux, networking, and distributed systems fundamentals
  • Experience working with containerized environments such as Docker and Kubernetes
  • Strong scripting and automation skills using Python and/or Bash
  • Experience participating in on-call rotations and incident response in production environments
  • Strong written and spoken English
  • Experience defining SLIs/SLOs and managing error budgets at scale will be considered a plus
  • Exposure to hyperscale or service-provider-grade platforms is an advantage
  • Cloud experience, preferably with Azure; experience with AWS and/or GCP will also be valued
  • Experience working with hybrid or on-premises integrations is beneficial
  • Familiarity with chaos engineering and resilience testing will be considered an asset

Benefits

Comp & perks
  • A competitive salary that values you and your unique skill sets
  • Career advancement & professional development opportunities to help you reach your full potential
  • Flexible work arrangements to support work/life balance

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
SLIsSLOserror budgetsKubernetesDockerPythonBashLinuxnetworkingdistributed systems
Soft Skills
incident coordinationcommunicationownershipprocess improvementleadershipcollaborationdocumentationblameless postmortemsscriptingautomation