Tech Stack
AWSAzureCloudGoogle Cloud PlatformKubernetesMongoDBPostgresPythonRedisSparkTerraform
About the role
- Act as a trusted technical leader setting direction on architecture, reliability, scalability, and developer experience.
- Architect, build, and support scalable and highly reliable software systems that power Ada’s platform growth.
- Lead the design and implementation of resilient cloud infrastructure (multi-region, multi-cloud where appropriate) to ensure uptime, scalability, and operational safety.
- Continuously analyze and optimize infrastructure for reliability, performance, and cost—removing bottlenecks, modernizing tooling, and streamlining workflows.
- Support developer tools and processes (CI/CD pipelines, deployment frameworks, environment provisioning) to maximize engineering velocity.
- Troubleshoot and resolve complex infrastructure issues; participate in and elevate on-call operations and incident response.
- Implement advanced DevOps practices across infrastructure as code, deployments, monitoring, and platform abstractions.
- Establish and maintain reliability standards, define and enforce SLOs/SLAs, and ensure observability is built into all systems.
- Lead cross-cutting initiatives to improve uptime, resiliency, and incident response processes, driving systemic reliability improvements.
- Create and evolve platform abstractions, patterns, frameworks, and tooling to accelerate developer velocity and reduce operational toil.
- Coach and mentor senior and mid-level engineers, contribute to engineering excellence, and represent DevOps in executive and cross-functional forums.
- Outcomes: scalable, reliable, cost-effective infrastructure supporting rapid growth; measurable improvements in uptime, resiliency, and developer velocity; reduced operational toil.
Requirements
- 8+ years of experience in DevOps, Site Reliability Engineering (SRE), or platform teams, with at least 2+ years operating at a Staff/Principal or equivalent senior technical leadership level.
- Recognized expertise in building and scaling cloud infrastructure (AWS/Azure/GCP), with proven experience designing multi-region, highly available systems.
- Deep technical knowledge of Kubernetes and container orchestration at scale (100s/1000s of nodes), including performance tuning, cost optimization, and failure mode analysis.
- Strong experience managing and scaling data infrastructure (e.g., MongoDB, PostgreSQL, Redis), with a focus on horizontal scaling, sharding, and performance optimization.
- Strong background in Infrastructure as Code (eg, Terraform) and GitOps tooling (eg, ArgoCD).
- Proficiency in Python, Bash, or equivalent scripting languages for automation.
- Experience creating and supporting cloud-based systems at scale (AWS/Azure/GCP), with a strong emphasis on Infrastructure as Code (IaC).
- Experience with MongoDB and horizontally scaling data stores (i.e. sharding).
- Experience leading incident response, root cause analysis, and systemic reliability improvements.
- A track record of technical leadership: driving cross-team initiatives, mentoring engineers, and shaping long-term infrastructure strategy.
- Excellent communication skills to translate technical complexity into business impact and influence cross-functional stakeholders.
- Nice to have: Experience with multi-cloud architecture and hybrid deployments.
- Nice to have: Familiarity with support tooling (PagerDuty, Datadog, Loft, Doppler) at organizational scale.