You will lead through influence, applying your unique combination of cloud infrastructure expertise and software engineering skills to solve complex reliability, scalability and operability challenges across our products and services.
You will enable teams by designing reusable cloud patterns, automation and guardrails to accelerate product delivery.
Monitor and improve service availability, performance and operability, ensuring systems meet defined SLAs and SLOs.
Define and evolve reliability metrics (SLIs and SLOs) and ensure they are observable and actionable.
Serve as an escalation point for critical incidents.
Lead incident response and blameless postmortems.
Drive proactive reliability improvements through problem identification, root cause analysis, remediation and systemic fixes.
Partner with development teams to design, deploy and manage cloud infrastructure using modern best practices.
Establish and evolve cloud standards, guardrails, and best practices that balance autonomy, security, reliability and cost efficiency.
Collaborate with product, platform, QA, and security team throughout the software lifecycle to deliver reliable, scalable, and compliant systems.
Advocate for changes that improve system reliability and engineering velocity.

Requirements

5+ years of experience in software engineering and SRE / DevOps roles supporting production systems.
Strong hands-on experience with C#, Python, Java, or Go particularly for tooling and automation.
Deep experience in at least one major cloud environment - Azure strongly preferred.
Hands-On experience with Kubernetes and containerized workloads (AKS, Helm, Kustomize)
Experience with Infrastructure as Code using Bicep or Terraform in complex cloud environments.
Strong background in observability platforms and monitoring strategies (Datadog, Prometheus etc.)
Working knowledge of GitOps and experience building and maintaining CI/CD Pipelines (Jenkins, Azure DevOps)
Systems-oriented mindset with a focus on availability, resilience and enabling teams to operate effectively in the cloud.
Proven ability to collaborate across teams to diagnose issues, identify root causes, and drive resolutions.
Experience leading or participating in incident response and post-incident reviews.
Solid understanding of change & release management practices.
Ability to learn quickly, adapt to new technologies, and operate with minimal supervision.
Strong written and verbal communication skills, with the ability to influence senior stakeholders and guide engineering teams.
Curiosity, adaptability and demonstrated habit of leaving systems better than you found them.

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

C#PythonJavaGoKubernetesBicepTerraformDatadogPrometheusCI/CD

Soft Skills

leadershipcollaborationproblem identificationroot cause analysiscommunicationadaptabilitycuriosityinfluenceteam enablementsystem improvement