Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Domino Data Lab

Staff Site Reliability Engineer

Domino Data Lab

Staff Site Reliability Engineer working on AI-assisted reliability tooling at Domino Data Lab. Leading incident response and enhancing system observability for critical services.

Posted 6/16/2026full-timeRemote • California • 🇺🇸 United StatesLead💰 $200,000 - $230,000 per yearWebsite

Tech Stack

Tools & technologies
CloudGoKubernetesLinuxPython

About the role

Key responsibilities & impact
  • Lead the development of Domino's internal AI-assisted reliability tooling, including systems that analyze tickets, logs, traces, and documentation to help teams resolve outages faster with less recurring toil
  • Improve the observability coverage and signal quality for our most critical customer-facing systems, so engineers have more to work with throughout the development and support lifecycle
  • Own incident response end-to-end, from detection to remediation, and leave each problem space better documented, better understood, and less likely to recur
  • Guide the development of customer and user-facing observability tools within our products
  • Define and mature SLO/SLI frameworks for priority services, turning abstract reliability goals into measurable, actionable standards
  • Scale cloud operations practices for Domino’s single-tenant SaaS offering, and work with engineering teams to improve the reliability and repeatability of customer deployments and upgrades
  • Mentor other engineers and shape how SRE is practiced at Domino, including incident response workflows, operational readiness expectations, and post-incident learning culture

Requirements

What you’ll need
  • Deep experience in Site Reliability Engineering, platform engineering, or a software engineering role with genuine, hands-on operational ownership
  • Fluency with Kubernetes, Linux, cloud platforms, and observability tooling, and the ability to use them to investigate complex, real-world production problems
  • A strong ability to perceive and close reliability gaps in technical products, tools and processes
  • Strong software engineering skills in Python or Go, with a track record of building internal tools or services that people actually rely on
  • Comfort leading technically ambiguous work and influencing direction across teams without needing direct authority to get things done
  • A history of improving reliability through engineering and automation, not just putting out fires manually
  • Strong communication skills and real experience mentoring engineers or shaping technical decision-making on your team
  • Sound judgment about AI/LLM tooling: you know where it genuinely helps in operational workflows and where it adds noise instead of signal
  • Bonus: Experience with LLM-based systems, retrieval workflows, SaaS platform operations, or building tooling for support or developer teams

Benefits

Comp & perks
  • equity
  • company bonus or sales commissions/bonuses
  • 401(k) plan
  • medical, dental, and vision benefits
  • wellness stipends

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Site Reliability Engineeringplatform engineeringsoftware engineeringKubernetesLinuxcloud platformsobservability toolingPythonGoAI/LLM tooling
Soft Skills
mentoringcommunicationinfluencingproblem-solvingoperational readinesspost-incident learningtechnical decision-makingperceiving reliability gapsleading ambiguous workclosing reliability gaps