Staff Site Reliability Engineer

Domino Data Lab

Staff Site Reliability Engineer working on AI-assisted reliability tooling at Domino Data Lab. Leading incident response and enhancing system observability for critical services.

Posted 6/16/2026full-timeRemote • California • 🇺🇸 United StatesLead💰 $200,000 - $230,000 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

Site Reliability Engineeringplatform engineeringsoftware engineeringKubernetesLinuxcloud platformsobservability toolingPythonGoAI/LLM tooling

Soft Skills

mentoringcommunicationinfluencingproblem-solvingoperational readinesspost-incident learningtechnical decision-makingperceiving reliability gapsleading ambiguous workclosing reliability gaps

Tools & Technologies

internal AI-assisted reliability toolingcustomer-facing observability toolsSLO/SLI frameworkscloud operations practicesincident response workflowssupport toolingdeveloper toolingSaaS platform operationsretrieval workflowsticket analysis systems

Industry Keywords

reliability engineeringoperational workflowsautomationincident responsecustomer deploymentstechnical productsengineering practicesproduction problemsmeasurable standardsdocumentation

Tech Stack

Tools & technologies

CloudGoKubernetesLinuxPython

About the role

Key responsibilities & impact

Lead the development of Domino's internal AI-assisted reliability tooling, including systems that analyze tickets, logs, traces, and documentation to help teams resolve outages faster with less recurring toil
Improve the observability coverage and signal quality for our most critical customer-facing systems, so engineers have more to work with throughout the development and support lifecycle
Own incident response end-to-end, from detection to remediation, and leave each problem space better documented, better understood, and less likely to recur
Guide the development of customer and user-facing observability tools within our products
Define and mature SLO/SLI frameworks for priority services, turning abstract reliability goals into measurable, actionable standards
Scale cloud operations practices for Domino’s single-tenant SaaS offering, and work with engineering teams to improve the reliability and repeatability of customer deployments and upgrades
Mentor other engineers and shape how SRE is practiced at Domino, including incident response workflows, operational readiness expectations, and post-incident learning culture

Requirements

What you’ll need

Deep experience in Site Reliability Engineering, platform engineering, or a software engineering role with genuine, hands-on operational ownership
Fluency with Kubernetes, Linux, cloud platforms, and observability tooling, and the ability to use them to investigate complex, real-world production problems
A strong ability to perceive and close reliability gaps in technical products, tools and processes
Strong software engineering skills in Python or Go, with a track record of building internal tools or services that people actually rely on
Comfort leading technically ambiguous work and influencing direction across teams without needing direct authority to get things done
A history of improving reliability through engineering and automation, not just putting out fires manually
Strong communication skills and real experience mentoring engineers or shaping technical decision-making on your team
Sound judgment about AI/LLM tooling: you know where it genuinely helps in operational workflows and where it adds noise instead of signal
Bonus: Experience with LLM-based systems, retrieval workflows, SaaS platform operations, or building tooling for support or developer teams

Benefits

Comp & perks

equity
company bonus or sales commissions/bonuses
401(k) plan
medical, dental, and vision benefits
wellness stipends