
Staff Software Engineer – Reliability
Temporal Technologies
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $212,000 - $286,200 per year
Job Level
Tech Stack
About the role
- Own reliability outcomes for operating Temporal Cloud end to end, partnering across engineering, infrastructure, and product to drive measurable improvements.
- Define, implement, and evolve reliability targets and associated practices, including alerting thresholds, operational readiness criteria, and escalation paths.
- Plan and run gamedays to validate incident response, operational procedures, and cross-team coordination under realistic failure scenarios.
- Build and scale a chaos testing program that exercises failure modes safely and drives remediation work that reduces real risk.
- Define and maintain a reliability scorecard across services and key operational processes, and use it to prioritize reliability investments.
- Lead load testing and performance testing efforts, including test design, tooling, and analysis of bottlenecks and capacity constraints.
- Improve observability standards (metrics, logs, traces, dashboards) so reliability signals are consistent, actionable, and easy to audit.
- Drive post-incident learning and corrective actions, ensuring fixes are durable and reduce recurrence risk over time.
- Make system-level tradeoffs across reliability, performance, cost, and velocity, and document decisions clearly for long-term maintainability.
- Mentor other engineers and raise the bar on reliability engineering practices across teams.
Requirements
- Strong computer science fundamentals, especially in distributed systems, concurrency, and performance.
- Demonstrated ability to design and build complex systems that operate reliably under high load and partial failure.
- Experience driving reliability improvements across multiple services, not just within a single codebase.
- Hands-on experience with at least one of: gamedays, chaos testing, load testing, or building reliability scorecards.
- Strong judgment in ambiguous situations, including the ability to prioritize reliability work based on risk and impact.
- Excellent communication skills, including the ability to align multiple stakeholders on reliability goals, plans, and tradeoffs.
- A collaborative mindset and a track record of mentoring and leveling up engineering practices.
Benefits
- Unlimited PTO, 12 Holidays + 2 Floating Holidays
- 100% Premiums Coverage for Medical, Dental, and Vision
- AD&D, LT & ST Disability, and Life Insurance (Standard & Supplemental Available)
- Empower 401K Plan
- Additional Perks for Learning & Development, Lifestyle Spending, In-Home Office Setup, Professional Memberships, WFH Meals, Internet Stipend and more!
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
reliability engineeringdistributed systemsconcurrencyperformance testingload testingchaos testingalerting thresholdsoperational readiness criteriareliability scorecardsincident response
Soft Skills
excellent communicationcollaborative mindsetmentoringstrong judgmentprioritizationcross-team coordinationpost-incident learningdecision documentationmeasurable improvementsambiguity management