FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Staff Site Reliability Engineer
Domino Data LabSite Reliability Engineer developing AI-assisted reliability tooling for Domino Data Lab. Leading incident response and mentorship for engineering best practices in a growing startup environment.
Tech Stack
Tools & technologiesCloudGoKubernetesLinuxPython
About the role
Key responsibilities & impact- Lead the development of Domino's internal AI-assisted reliability tooling, including systems that analyze tickets, logs, traces, and documentation to help teams resolve outages faster with less recurring toil
- Improve the observability coverage and signal quality for our most critical customer-facing systems, so engineers have more to work with throughout the development and support lifecycle
- Own incident response end-to-end, from detection to remediation, and leave each problem space better documented, better understood, and less likely to recur
- Guide the development of customer and user-facing observability tools within our products
- Define and mature SLO/SLI frameworks for priority services, turning abstract reliability goals into measurable, actionable standards
- Scale cloud operations practices for Domino’s single-tenant SaaS offering, and work with engineering teams to improve the reliability and repeatability of customer deployments and upgrades
- Mentor other engineers and shape how SRE is practiced at Domino, including incident response workflows, operational readiness expectations, and post-incident learning culture
Requirements
What you’ll need- Deep experience in Site Reliability Engineering, platform engineering, or a software engineering role with genuine, hands-on operational ownership
- Fluency with Kubernetes, Linux, cloud platforms, and observability tooling, and the ability to use them to investigate complex, real-world production problems
- A strong ability to perceive and close reliability gaps in technical products, tools and processes
- Strong software engineering skills in Python or Go, with a track record of building internal tools or services that people actually rely on
- Comfort leading technically ambiguous work and influencing direction across teams without needing direct authority to get things done
- A history of improving reliability through engineering and automation, not just putting out fires manually
- Strong communication skills and real experience mentoring engineers or shaping technical decision-making on your team
- Sound judgment about AI/LLM tooling: you know where it genuinely helps in operational workflows and where it adds noise instead of signal
- Bonus: Experience with LLM-based systems, retrieval workflows, SaaS platform operations, or building tooling for support or developer teams
Benefits
Comp & perks- We strongly believe in the value of growing a diverse team and encourage people of all backgrounds, genders, ethnicities, abilities, and sexual orientations to apply
- We value a growth mindset. High-performing creative individuals who dig into problems and see the opportunities for success
- We believe in individuals who seek truth and speak the truth and can be their whole selves at work.
- We value all of you that believe improving is always possible. At Domino, everything is a work in progress – we can do better at everything.
- We emphasize an environment of teaching and learning to equip employees with the tools needed to be successful in their function and the company.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability Engineeringplatform engineeringsoftware engineeringKubernetesLinuxcloud platformsobservability toolingPythonGoSLO/SLI frameworks
Soft Skills
mentoringcommunicationinfluencingproblem-solvingoperational readinesspost-incident learningtechnical decision-makingperceiving reliability gapsleading ambiguous workjudgment