Staff Site Reliability Engineer

Twelve Labs

Staff Site Reliability Engineer responsible for production reliability and infrastructure for multimodal AI models at Twelve Labs. Collaborating with product teams to ensure system health and performance.

Posted 5/8/2026full-timeSan Francisco • California • 🇺🇸 United StatesLead💰 $220,000 - $250,000 per yearWebsite

Tech Stack

Tools & technologies

AnsibleAWSCloudGrafanaKubernetesPrometheusTerraform

About the role

Key responsibilities & impact

Own production reliability end to end — from deployment through monitoring, incident response, and postmortem-driven improvement.
Partner with the product engineering teams to ensure their services are reliable, observable, and operable by design.
Build and maintain observability systems (metrics, logging, tracing, alerting) that give the team clear signal on system health and performance.
Design and operate cloud infrastructure supporting AI/ML workloads.
Drive incident response — detect, diagnose, mitigate, and prevent production issues. Build the runbooks, automation, and guardrails that reduce mean time to recovery.
Identify and eliminate toil through automation, self-healing systems, and better tooling.

Requirements

What you’ll need

7+ years of experience operating production infrastructure systems, not just building them.
Strong hands-on experience with AWS, Kubernetes in production environments.
Solid fundamentals in OS internals, networking, storage, and compute — the kind that help you debug a problem at 3am without documentation.
Deep practical experience with observability (Prometheus/Grafana/Loki or equivalent), Infrastructure as Code (Terraform, Ansible), and CI/CD.
Track record of owning services end to end — deployment, monitoring, incident response, and postmortem follow-through.

Benefits

Comp & perks

An open and inclusive culture and work environment
Work closely with a collaborative, mission-driven team on cutting-edge AI technology
Full health, dental, and vision benefits
Extremely flexible PTO and parental leave policy. Office closed the week of Christmas and New Years
Monthly wellness stipend
Annual Learning & Development stipend to invest in your growth
Global offices in San Francisco and Seoul, and coworking office memberships for remote team members
VISA support where applicable
Transportation stipend
Daily lunch & dinner provided

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

production reliabilityincident responseobservabilitycloud infrastructureAI/ML workloadsautomationInfrastructure as CodeCI/CDOS internalsnetworking

Soft Skills

problem-solvingcollaborationcommunicationownershipproactive improvement