Senior Site Reliability Engineer

NVIDIA

full-time

Posted on: 9/2/2025

Origin: • 🇮🇳 India

✨ AI Apply

Senior

AnsibleAWSAzureChefCloudGoGoogle Cloud PlatformGrafanaKubernetesLinuxMicroservicesPrometheusPuppetPythonSplunkTCP/IPTerraform

About the role

NVIDIA DGX Cloud delivering a fully managed AI platform on major cloud providers
Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
Define SLOs/SLIs, monitor error budgets, and streamline reporting
Support services before launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
Maintain services once live by measuring and monitoring availability, latency and overall system health
Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
Scale systems sustainably through automation and evolve systems to improve reliability and velocity
Lead triage and root-cause analysis of high-severity incidents, perform blameless postmortems
Participate in on-call rotation to support production services

BS in Computer Science or related technical field, or equivalent experience
10+ years of experience operating production services
Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
Proficiency in at least one high-level programming language (e.g., Python, Go)
In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
Experience building and operating comprehensive observability stacks (OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.)
Experience operating GPU workloads and GPU-accelerated clusters (KubeVirt experience is a plus)