Temporal Technologies

Staff Software Engineer – Reliability

Temporal Technologies

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $212,000 - $286,200 per year

Job Level

About the role

  • Own reliability outcomes for operating Temporal Cloud end to end, partnering across engineering, infrastructure, and product to drive measurable improvements.
  • Define, implement, and evolve reliability targets and associated practices, including alerting thresholds, operational readiness criteria, and escalation paths.
  • Plan and run gamedays to validate incident response, operational procedures, and cross-team coordination under realistic failure scenarios.
  • Build and scale a chaos testing program that exercises failure modes safely and drives remediation work that reduces real risk.
  • Define and maintain a reliability scorecard across services and key operational processes, and use it to prioritize reliability investments.
  • Lead load testing and performance testing efforts, including test design, tooling, and analysis of bottlenecks and capacity constraints.
  • Improve observability standards (metrics, logs, traces, dashboards) so reliability signals are consistent, actionable, and easy to audit.
  • Drive post-incident learning and corrective actions, ensuring fixes are durable and reduce recurrence risk over time.
  • Make system-level tradeoffs across reliability, performance, cost, and velocity, and document decisions clearly for long-term maintainability.
  • Mentor other engineers and raise the bar on reliability engineering practices across teams.

Requirements

  • Strong computer science fundamentals, especially in distributed systems, concurrency, and performance.
  • Demonstrated ability to design and build complex systems that operate reliably under high load and partial failure.
  • Experience driving reliability improvements across multiple services, not just within a single codebase.
  • Hands-on experience with at least one of: gamedays, chaos testing, load testing, or building reliability scorecards.
  • Strong judgment in ambiguous situations, including the ability to prioritize reliability work based on risk and impact.
  • Excellent communication skills, including the ability to align multiple stakeholders on reliability goals, plans, and tradeoffs.
  • A collaborative mindset and a track record of mentoring and leveling up engineering practices.
Benefits
  • Unlimited PTO, 12 Holidays + 2 Floating Holidays
  • 100% Premiums Coverage for Medical, Dental, and Vision
  • AD&D, LT & ST Disability, and Life Insurance (Standard & Supplemental Available)
  • Empower 401K Plan
  • Additional Perks for Learning & Development, Lifestyle Spending, In-Home Office Setup, Professional Memberships, WFH Meals, Internet Stipend and more!
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
reliability engineeringdistributed systemsconcurrencyperformance testingload testingchaos testingalerting thresholdsoperational readiness criteriareliability scorecardsincident response
Soft Skills
excellent communicationcollaborative mindsetmentoringstrong judgmentprioritizationcross-team coordinationpost-incident learningdecision documentationmeasurable improvementsambiguity management