Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Twelve Labs

Staff Site Reliability Engineer

Twelve Labs

Staff Site Reliability Engineer responsible for production reliability and infrastructure for multimodal AI models at Twelve Labs. Collaborating with product teams to ensure system health and performance.

Posted 5/8/2026full-timeSan Francisco • California • 🇺🇸 United StatesLead💰 $220,000 - $250,000 per yearWebsite

Tech Stack

Tools & technologies
AnsibleAWSCloudGrafanaKubernetesPrometheusTerraform

About the role

Key responsibilities & impact
  • Own production reliability end to end — from deployment through monitoring, incident response, and postmortem-driven improvement.
  • Partner with the product engineering teams to ensure their services are reliable, observable, and operable by design.
  • Build and maintain observability systems (metrics, logging, tracing, alerting) that give the team clear signal on system health and performance.
  • Design and operate cloud infrastructure supporting AI/ML workloads.
  • Drive incident response — detect, diagnose, mitigate, and prevent production issues. Build the runbooks, automation, and guardrails that reduce mean time to recovery.
  • Identify and eliminate toil through automation, self-healing systems, and better tooling.

Requirements

What you’ll need
  • 7+ years of experience operating production infrastructure systems, not just building them.
  • Strong hands-on experience with AWS, Kubernetes in production environments.
  • Solid fundamentals in OS internals, networking, storage, and compute — the kind that help you debug a problem at 3am without documentation.
  • Deep practical experience with observability (Prometheus/Grafana/Loki or equivalent), Infrastructure as Code (Terraform, Ansible), and CI/CD.
  • Track record of owning services end to end — deployment, monitoring, incident response, and postmortem follow-through.

Benefits

Comp & perks
  • An open and inclusive culture and work environment
  • Work closely with a collaborative, mission-driven team on cutting-edge AI technology
  • Full health, dental, and vision benefits
  • Extremely flexible PTO and parental leave policy. Office closed the week of Christmas and New Years
  • Monthly wellness stipend
  • Annual Learning & Development stipend to invest in your growth
  • Global offices in San Francisco and Seoul, and coworking office memberships for remote team members
  • VISA support where applicable
  • Transportation stipend
  • Daily lunch & dinner provided

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
production reliabilityincident responseobservabilitycloud infrastructureAI/ML workloadsautomationInfrastructure as CodeCI/CDOS internalsnetworking
Soft Skills
problem-solvingcollaborationcommunicationownershipproactive improvement