NVIDIA

Senior Site Reliability Engineer

NVIDIA

full-time

Posted on:

Origin:  • 🇮🇳 India

Visit company website
AI Apply
Manual Apply

Job Level

Senior

Tech Stack

AnsibleAWSAzureChefCloudGoGoogle Cloud PlatformGrafanaKubernetesLinuxMicroservicesPrometheusPuppetPythonSplunkTCP/IPTerraform

About the role

  • NVIDIA DGX Cloud delivering a fully managed AI platform on major cloud providers
  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
  • Define SLOs/SLIs, monitor error budgets, and streamline reporting
  • Support services before launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
  • Maintain services once live by measuring and monitoring availability, latency and overall system health
  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
  • Scale systems sustainably through automation and evolve systems to improve reliability and velocity
  • Lead triage and root-cause analysis of high-severity incidents, perform blameless postmortems
  • Participate in on-call rotation to support production services

Requirements

  • BS in Computer Science or related technical field, or equivalent experience
  • 10+ years of experience operating production services
  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
  • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
  • Proficiency in at least one high-level programming language (e.g., Python, Go)
  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
  • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
  • Experience building and operating comprehensive observability stacks (OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.)
  • Experience operating GPU workloads and GPU-accelerated clusters (KubeVirt experience is a plus)