Senior Site Reliability Engineer, AI Infrastructure

NVIDIA

full-time

Posted on: 8/27/2025

Origin: • 🇺🇸 United States • California

✨ AI Apply

💰 $184,000 - $356,500 per year

Senior

AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformLinuxPerlPrometheusPythonPyTorchRayRubyTensorflowTerraform

About the role

Develop and maintain large-scale systems supporting critical use cases for AI Infrastructure across global public and private clouds.
Implement SRE fundamentals including incident management, monitoring, and performance optimization; design automation tools to reduce operational overhead.
Build tools and frameworks to improve observability, define actionable reliability metrics, and enable fast issue resolution.
Establish frameworks for operational maturity, lead incident response protocols, and conduct blameless postmortems.
Work with engineering teams to deliver solutions, mentor peers, uphold code/infrastructure standards, and contribute to hiring.

Degree in Computer Science or related field, or equivalent experience with 8+ years in Software Development, SRE, or Production Engineering.
Proficiency in Python and at least one other language (C/C++, Go, Perl, Ruby).
Expertise in systems engineering within Linux or Windows environments and cloud platforms (AWS, OCI, Azure, GCP).
Strong understanding of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools (e.g., Terraform CDK).
Hands-on experience with observability platforms (e.g., ELK, Prometheus, Loki) and CI/CD systems (e.g., GitLab).
Strong communication skills and commitment to fostering diversity and continuous improvement.