Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
NVIDIA

Senior Site Reliability Engineer – Observability, Telemetry Platform

NVIDIA

Site Reliability Engineer designing and supporting observability and telemetry platforms for GPU cloud services at NVIDIA. Focused on operational reliability, performance, and incident response.

Posted 5/8/2026full-timeRemote • California • 🇺🇸 United StatesSenior💰 $176,000 - $333,500 per yearWebsite

Tech Stack

Tools & technologies
CloudDistributed SystemsGoLinuxPerlPythonRuby

About the role

Key responsibilities & impact
  • Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform with a focus on performance at scale, real time monitoring, logging and alerting
  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
  • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
  • Practice sustainable incident response and blameless postmortems
  • Be part of an on call rotation to support production systems

Requirements

What you’ll need
  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
  • 8+ years of experience with Infrastructure automation
  • distributed systems design
  • experience with design, develop tools for running large scale private or public cloud system in Production
  • 5+ years experience delivering foundational infrastructure and observability platforms.
  • Experience in one or more of the following: Python, Go, Perl or Ruby
  • In depth knowledge on Linux, Networking and Containers

Benefits

Comp & perks
  • equity
  • benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Infrastructure automationdistributed systems designPythonGoPerlRubyLinuxNetworkingContainersobservability platforms
Soft Skills
system design consultingincident responseblameless postmortemsperformance at scalereal time monitoringloggingalertingcapacity managementsupport production systemsimprove reliability
Certifications
BS degree in Computer Sciencerelated technical field