Senior Site Reliability Engineer

NVIDIA

full-time

Posted on: 8/29/2025

Origin: • 🇺🇸 United States • California

✨ AI Apply

💰 $208,000 - $333,500 per year

Senior

AnsibleAWSAzureChefCloudDistributed SystemsDNSGoGoogle Cloud PlatformGrafanaKubernetesLinuxMicroservicesPrometheusPuppetPythonSplunkTCP/IPTerraform

About the role

Support large-scale Kubernetes services before they launch through system creation consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews
Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with a focus on performance at scale, real-time monitoring, logging and alerting
Define SLOs/SLIs, monitor error budgets, and streamline reporting
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health
Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
Lead triage and root-cause analysis of high-severity incidents
Practice balanced incident response and blameless postmortems
Participate in on-call rotation to support production services

BS in Computer Science or related technical field, or equivalent experience
12+ years of experience operating production services at scale
Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture, with deep experience in Kubernetes operators and distributed systems at scale
Experience with infrastructure automation tools (Terraform, Ansible, Chef, Puppet)
Proficiency in at least one high-level programming language (e.g., Python, Go)
In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
Demonstrated ability to troubleshoot complex DNS, network, Kubernetes, and systems issues in production environments
Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, Datadog, etc.
Ways to stand out from the crowd: Operating GPU-accelerated clusters with KubeVirt in production; Applying generative-AI techniques to reduce operational toil; Automating incidents with Shoreline or StackStorm