Tech Stack
AWSAzureGoGoogle Cloud PlatformGrafanaKubernetesPostgresPrometheusPythonRustTerraform
About the role
- Establish SRE patterns, tooling and culture to ensure systems are fast, observable and resilient
- Design and harden production infrastructure (AWS ECS/Fargate via AWS Copilot migrating to Terraform, RDS/Postgres, S3, EventBridge, Bedrock)
- Lead reliability engineering: define SLOs/SLAs, error-budget policies, capacity planning and load testing
- Own CI/CD: advance GitHub Actions pipeline, introduce progressive delivery and automated rollbacks
- Instrument & Observe: deploy metrics, tracing and logging (Datadog) and run on-call with focus on MTTR and blameless reviews
- Security & compliance: automate patching, secrets management & rotation, enforce least-privilege IAM and SOC 2 controls
- Coach & collaborate: mentor engineers, work with ML and product squads, influence architecture
- Continuously improve: identify systemic bottlenecks, build tooling to eliminate toil and scale the platform
- Drive migrations and disaster recovery planning (AWS Control Tower migration, RDS to Aurora migration, Terraform adoption)
Requirements
- 6+ yrs building and operating production systems in AWS, GCP or Azure (AWS preferred)
- Demonstrated ownership of SLOs, incident response and post-incident analysis
- Expert in IaC (Terraform, CDK, Pulumi)
- Experience with container orchestration (ECS, EKS or K8s)
- Proficient with at least one modern language (Python, Rust, Go) and strong bash skills
- Deep familiarity with observability stacks (Datadog, Grafana, Prometheus, OTEL)
- Track record of raising the bar for security, compliance and cost optimisation
- (Nice-to-have) Experience with infrastructure for ML workflows (model training, feature stores)
- (Nice-to-have) Prior work in construction-tech, gov-tech or other regulated domains
- (Nice-to-have) Certification: AWS Solutions Architect or DevOps Pro
- (Nice-to-have) Experience introducing chaos engineering or game-days
- (Nice-to-have) Public track record (blog posts, OSS) advancing the SRE discipline
- (Nice-to-have) Leadership in defining hiring/on-call processes at a high-growth startup