Salary
💰 $160,000 - $200,000 per year
Tech Stack
AWSCloudDynamoDBJavaJavaScriptKubernetesPythonSDLCTerraformTypeScript
About the role
- Reliability: Own the company-wide incident lifecycle; standards for detection, escalation, incident command, customer comms, and high-quality postmortems with action tracking.
- Define and drive SLIs/SLOs for core services; build guardrails and dashboards that make reliability visible and actionable.
- Lead production readiness reviews, capacity/performance planning, load testing, disaster recovery exercises, and resilience engineering (failure testing/chaos where appropriate).
- Level-up on-call: right-sizing rotations, paging hygiene, runbooks, auto-remediation, and continuous improvement of MTTA/MTTR.
- Security: Embed security into the delivery pipeline: dependency and image scanning, least-privilege/IAM baselines, secrets management, and service-to-service auth.
- SOC 2-aligned controls as code; audit-friendly evidence generation in everyday engineering.
- Drive secure-by-default patterns in the platform (network posture, data protection, runtime policies).
- Platform & DevEx: Build and evolve paved roads for deploys, config, and runtime operations in our monorepo (Bazel) and CI/CD (AWS CodePipeline/CodeBuild).
- Partner with product teams to make the secure default the easiest path—templates, tooling, libraries, and automation.
- Improve observability end-to-end (traces, logs, metrics, alerts).
Requirements
- Experienced: Staff-level IC who has led reliability programs at meaningful scale and owned incident response standards.
- Technically Grounded: Deep, hands-on experience with infrastructure at scale, cloud, containerization, and more:
- AWS (multi-service)
- ECS and/or Kubernetes containerization workloads
- CICD & IaC (Terraform)
- Production Networking/Fundamentals
- Python Proficient: You can read/review service code and land operational improvements.
- Data Driven: In your approach to SLOs, capacity, performance, and cost efficiency with strong observability chops
- Influential: Able to shape direction and create simple, durable standards
- Communicative: Excels in both technical and interpersonal communication, with strong written and verbal skills
- Nice To Have: FinOps, SOC 2, Data Science/ML collaboration, monorepo frameworks (bazel, buck)