Senior Site Reliability Engineer

GreenLite

full-time

Posted on: 9/11/2025

Origin: • 🇺🇸 United States • New York

✨ AI Apply

Senior

AWSAzureGoGoogle Cloud PlatformGrafanaKubernetesPostgresPrometheusPythonRustTerraform

About the role

Establish SRE patterns, tooling and culture to ensure systems are fast, observable and resilient
Design and harden production infrastructure (AWS ECS/Fargate via AWS Copilot migrating to Terraform, RDS/Postgres, S3, EventBridge, Bedrock)
Lead reliability engineering: define SLOs/SLAs, error-budget policies, capacity planning and load testing
Own CI/CD: advance GitHub Actions pipeline, introduce progressive delivery and automated rollbacks
Instrument & Observe: deploy metrics, tracing and logging (Datadog) and run on-call with focus on MTTR and blameless reviews
Security & compliance: automate patching, secrets management & rotation, enforce least-privilege IAM and SOC 2 controls
Coach & collaborate: mentor engineers, work with ML and product squads, influence architecture
Continuously improve: identify systemic bottlenecks, build tooling to eliminate toil and scale the platform
Drive migrations and disaster recovery planning (AWS Control Tower migration, RDS to Aurora migration, Terraform adoption)

6+ yrs building and operating production systems in AWS, GCP or Azure (AWS preferred)
Demonstrated ownership of SLOs, incident response and post-incident analysis
Expert in IaC (Terraform, CDK, Pulumi)
Experience with container orchestration (ECS, EKS or K8s)
Proficient with at least one modern language (Python, Rust, Go) and strong bash skills
Deep familiarity with observability stacks (Datadog, Grafana, Prometheus, OTEL)
Track record of raising the bar for security, compliance and cost optimisation
(Nice-to-have) Experience with infrastructure for ML workflows (model training, feature stores)
(Nice-to-have) Prior work in construction-tech, gov-tech or other regulated domains
(Nice-to-have) Certification: AWS Solutions Architect or DevOps Pro
(Nice-to-have) Experience introducing chaos engineering or game-days
(Nice-to-have) Public track record (blog posts, OSS) advancing the SRE discipline
(Nice-to-have) Leadership in defining hiring/on-call processes at a high-growth startup