Tech Stack
AWSCloudDockerGrafanaJavaScriptKafkaKubernetesMicroservicesNode.jsPrometheusPythonReactTerraformTypeScript
About the role
- Own ephemeral & preview environments: Partner with INFRA to design, build, and scale automated, short-lived environments (per PR/feature) including safe data strategies (seed/snapshot/masking), TTL policies, cost controls, and consistent app/infra templates.
- CI/CD at scale: Streamline and standardize pipelines (e.g., CircleCI → Argo CD) for services and jobs; speed up builds/tests with caching, parallelism, and flake reduction; maintain artifact/versioning strategies across our monorepos.
- Release gating & progressive delivery: Partner with QA to implement quality gates (tests, coverage deltas, policy checks).
- Observability & reliability: Partner with INFRA to level up metrics, logs, and traces (Datadog/OpenTelemetry); define health checks and deployment KPIs; contribute to on-call readiness, incident response, and post-incident improvements.
- Vendor integration: Assist Product Engineering with building robust integrations for external services (e.g., Confluent Cloud/Kafka) with secure networking, credentials, and monitoring; document best practices as reusable templates.
- Developer experience: Contribute to internal tooling and documentation that make the “right way the easy way" - CLIs, scaffolds, templates, and playbooks for environment creation, deploys, and debugging.
- Measure & iterate: Track DORA metrics (lead time, deploy frequency, change failure rate, MTTR); set targets and deliver continuous improvements.
Requirements
- 6–10+ years in DevOps/Platform/SRE roles building and operating production systems at scale.
- Expertise with Kubernetes (EKS) and AWS (IAM, VPC, ECR, SSM/Secrets Manager, CloudWatch, S3, SQS, Lambda, RDS/Aurora).
- Strong IaC chops (Terraform preferred) and GitOps workflows (Argo CD or similar).
- Proven track record building ephemeral/preview environments and standardizing app/infra templates across many services.
- CI/CD mastery (CircleCI or similar) including caching/parallelism, artifact mgmt, test reliability, and pipeline observability.
- Experience with release strategies (canary/blue-green, automated rollbacks) and quality gates.
- Observability fundamentals (Datadog/Prometheus/Grafana, OpenTelemetry); ability to define SLIs/SLOs and wire them to delivery decisions.
- Excellent cross-team communicator who can translate platform constraints into developer-friendly solutions and documentation.
- Comfort in an AI-augmented engineering culture; enthusiasm for automation and building tools that eliminate toil.