Salary
💰 $134,000 - $254,000 per year
Tech Stack
AndroidAWSAzureCloudDockerGoJavaJenkinsJMeterKubernetesMicroservicesPythonServiceNowSplunkTerraformTypeScriptWordPress
About the role
- Lead the end-to-end delivery of reliability solutions that meet customer needs while aligning with technology guardrails and strategic roadmaps.
- Define and implement SLOs, SLIs, and error-budget policies; integrate them with CI/CD pipelines and automated quality gates.
- Design and build cloud-native reliability tooling—auto-scaling, self-healing, blue/green and canary release frameworks—leveraging AWS services (EKS, Lambda, Fargate, Auto Scaling, Route 53, CloudWatch).
- Implement and extend observability platforms (metrics, logs, traces, events) using Datadog, SPLUNK, and AWS native services.
- Drive Gen-AI/ML experimentation for anomaly detection, predictive scaling, and automated incident triage; transition validated prototypes into production platforms.
- Champion infrastructure-as-code (Terraform, CloudFormation, CDK) and GitOps workflows to ensure repeatable, auditable changes.
- Embed chaos engineering and resilience testing (Gremlin, Litmus, ChaosMesh, Fault Injection Simulator) into release pipelines.
- Optimize incident management processes: blameless post-mortems, rapid root-cause analysis, actionable runbooks, and continuous learning loops.
- Collaborate with Quality Engineering, Security, Architecture, and Delivery teams to create an end-to-end DevTestOps ecosystem.
- Mentor and coach engineers, fostering a culture of reliability, automation, and customer-centric thinking.
- Stay current on emerging technologies—container orchestration, service mesh, serverless, edge computing, Gen-AI for ops—and apply relevant innovations to ongoing work.
- Document architectures, reliability standards, and operational playbooks for maintainability and knowledge transfer.
Requirements
- Bachelor’s or master’s degree in computer science, Engineering, or a related discipline (or equivalent experience).
- 10+ years of hands-on engineering experience, with at least 5 years focused on SRE, DevOps, or large-scale cloud operations.
- Deep knowledge of containerization (Docker, Kubernetes/EKS), service mesh (Istio, Linkerd), and microservice architectures.
- Practical experience with observability stacks (Datadog, Splunk).
- Proficiency in at least one programming language (Python, Go, Java, TypeScript, or similar).
- Familiarity with CI/CD systems (GitHub Actions, Azure DevOps, Jenkins) and release strategies (blue/green, canary, feature flags).
- Hands-on exposure to chaos-engineering and resilience testing tools (Gremlin, ChaosMesh) and load/performance tools (k6, JMeter, LoadRunner).
- Experience with incident management platforms (ServiceNow) and running blameless post-mortems.
- Strong communication, facilitation, consensus-building, and stakeholder-management skills.
- Relevant certifications (AWS DevOps, Kubernetes, Observability platforms) are a plus.