Tech Stack
AWSEC2GrafanaKubernetesMicroservicesPrometheusSDLCTerraform
About the role
- Own SRE establishing best practices, tooling, and culture
- Tackle reliability challenges unique to multi-agent orchestration at enterprise scale
- Guarantee >99.9% uptime of production systems, ensuring reliability at global scale
- Architect and automate AWS infrastructure with Terraform and CI/CD pipelines
- Design observability systems across microservices, APIs, and vector infrastructure (metrics, tracing, logging)
- Drive down incidents and MTTR through runbooks, alerting, and incident response excellence
- Help scale infra to support hundreds of thousands of agents and billions of API calls
- Partner with engineering teams to embed SRE principles into the SDLC and shape org-wide reliability strategy
- Act as a founding voice in our SF office, influencing product direction and engineering culture
Requirements
- 5+ years in SRE/DevOps/Infrastructure roles, with experience in enterprise SaaS environments.
- Deep AWS expertise (EC2, ECS/EKS, Lambda, RDS, VPC, IAM).
- Proven track record with Infrastructure as Code (Terraform, Kubernetes/EKS, CDK, or CloudFormation).
- Hands-on with observability stacks (CloudWatch, Grafana, Prometheus, Datadog).
- Incident management experience in production SaaS systems, including on-call, postmortems, and reliability improvements.
- **Bonus**: Prior exposure to AI/ML platforms, data-heavy systems, or multi-agent workloads.
- Hybrid work model (3 days in office)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
SREDevOpsInfrastructureAWSTerraformCI/CDKubernetesInfrastructure as Codeobservabilityincident management
Soft skills
leadershipcommunicationcollaborationinfluenceproblem-solving