FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Principal Site Reliability Engineer – Observability, Telemetry Platform
NVIDIAPrincipal Site Reliability Engineer at NVIDIA focusing on observability and telemetry platforms. Responsible for designing, implementing, and supporting operational reliability of large scale systems.
Posted 5/19/2026full-timeSanta Clara • California • 🇺🇸 United StatesLead💰 $248,000 - $396,750 per yearWebsite
Tech Stack
Tools & technologiesCloudDistributed SystemsGoLinuxPerlPythonRuby
About the role
Key responsibilities & impact- Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform with a focus on performance at scale, real time monitoring, logging and alerting
- Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
- Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
- Practice sustainable incident response and blameless postmortems
- Be part of an on call rotation to support production systems
Requirements
What you’ll need- BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
- 15+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production
- 8+ years experience delivering foundational infrastructure and observability platforms.
- Experience in one or more of the following: Python, Go, Perl or Ruby
- In depth knowledge on Linux, Networking and Containers
Benefits
Comp & perks- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonGoPerlRubyInfrastructure automationDistributed systems designObservability platformsLinuxNetworkingContainers
Soft Skills
system design consultingincident responseblameless postmortemsperformance at scalereal time monitoringloggingalertingcapacity managementservice lifecycle managementautomation
Certifications
BS degree in Computer Sciencerelated technical field