Relevance AI

Senior Site Reliability Engineer

Relevance AI

full-time

Posted on:

Location Type: Hybrid

Location: San Francisco • California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

AWSEC2GrafanaKubernetesMicroservicesPrometheusSDLCTerraform

About the role

  • Own SRE establishing best practices, tooling, and culture
  • Tackle reliability challenges unique to multi-agent orchestration at enterprise scale
  • Guarantee >99.9% uptime of production systems, ensuring reliability at global scale
  • Architect and automate AWS infrastructure with Terraform and CI/CD pipelines
  • Design observability systems across microservices, APIs, and vector infrastructure (metrics, tracing, logging)
  • Drive down incidents and MTTR through runbooks, alerting, and incident response excellence
  • Help scale infra to support hundreds of thousands of agents and billions of API calls
  • Partner with engineering teams to embed SRE principles into the SDLC and shape org-wide reliability strategy
  • Act as a founding voice in our SF office, influencing product direction and engineering culture

Requirements

  • 5+ years in SRE/DevOps/Infrastructure roles, with experience in enterprise SaaS environments.
  • Deep AWS expertise (EC2, ECS/EKS, Lambda, RDS, VPC, IAM).
  • Proven track record with Infrastructure as Code (Terraform, Kubernetes/EKS, CDK, or CloudFormation).
  • Hands-on with observability stacks (CloudWatch, Grafana, Prometheus, Datadog).
  • Incident management experience in production SaaS systems, including on-call, postmortems, and reliability improvements.
  • **Bonus**: Prior exposure to AI/ML platforms, data-heavy systems, or multi-agent workloads.
Benefits
  • Hybrid work model (3 days in office)

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
SREDevOpsInfrastructureAWSTerraformCI/CDKubernetesInfrastructure as Codeobservabilityincident management
Soft skills
leadershipcommunicationcollaborationinfluenceproblem-solving
Gridware

Senior Site Reliability Engineer

Gridware
Seniorfull-timeCalifornia · 🇺🇸 United States
Posted: 3 hours agoSource: jobs.lever.co
AWSDistributed SystemsEC2GrafanaKafkaKubernetesPrometheusTerraform
Adobe

Senior Site Reliability Engineer

Adobe
Seniorfull-time$134k–$242k / yearCalifornia, New York · 🇺🇸 United States
Posted: 1 day agoSource: adobe.wd5.myworkdayjobs.com
Cloud
EEOC

DevOps Engineer

EEOC
Mid · Seniorfull-time$78k–$176k / yearAlabama, California, Colorado, Virginia · 🇺🇸 United States
Posted: 2 days agoSource: bah.wd1.myworkdayjobs.com
AWSAzureCloudDockerJenkinsKubernetes
GEICO

Senior Staff Engineer, Software Engineering – CICD, DevOps, Change Management

GEICO
Seniorfull-time$130k–$260k / yearCalifornia, Maryland, Washington · 🇺🇸 United States
Posted: 3 days agoSource: geico.wd1.myworkdayjobs.com
AWSAzureCloudDockerGoogle Cloud PlatformKubernetesNoSQLPythonSQL