NVIDIA

Senior Site Reliability Engineer, AI Infrastructure

NVIDIA

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Manual Apply

Salary

💰 $184,000 - $356,500 per year

Job Level

Senior

Tech Stack

AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformLinuxPerlPrometheusPythonPyTorchRayRubyTensorflowTerraform

About the role

  • Develop and maintain large-scale systems supporting critical use cases for AI Infrastructure across global public and private clouds.
  • Implement SRE fundamentals including incident management, monitoring, and performance optimization; design automation tools to reduce operational overhead.
  • Build tools and frameworks to improve observability, define actionable reliability metrics, and enable fast issue resolution.
  • Establish frameworks for operational maturity, lead incident response protocols, and conduct blameless postmortems.
  • Work with engineering teams to deliver solutions, mentor peers, uphold code/infrastructure standards, and contribute to hiring.

Requirements

  • Degree in Computer Science or related field, or equivalent experience with 8+ years in Software Development, SRE, or Production Engineering.
  • Proficiency in Python and at least one other language (C/C++, Go, Perl, Ruby).
  • Expertise in systems engineering within Linux or Windows environments and cloud platforms (AWS, OCI, Azure, GCP).
  • Strong understanding of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools (e.g., Terraform CDK).
  • Hands-on experience with observability platforms (e.g., ELK, Prometheus, Loki) and CI/CD systems (e.g., GitLab).
  • Strong communication skills and commitment to fostering diversity and continuous improvement.