NVIDIA

Principal Staff Site Reliability Engineer

NVIDIA

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Manual Apply

Salary

💰 $248,000 - $391,000 per year

Job Level

Lead

Tech Stack

CloudDistributed SystemsDNSGoLinuxMicroservicesPythonTerraform

About the role

  • Lead initiatives to transform IT Compute Core Team, architecture to build new service offerings across On-Prem and Cloud
  • You will design, scale, and deploy core infrastructure services including DNS, NTP/PTP, DHCP, and LDAP
  • This includes building for performance and reliability at global scale, covering automation, monitoring, high availability, capacity planning, and lifecycle management
  • Define and implement metrics to measure the efficiency of services and drive efficiency with software and hardware optimizations (SR-IOV/ DPU)
  • Experience with Technologies like eBPF and XDP for Observability & DDoS mitigation
  • Collect and review system data for capacity and planning purposes, analyze capacity data and develop plans for appropriate level enterprise-wide systems, and coordinate with management personnel in implementing changes
  • Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, monitoring
  • Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop compelling IT products and services that meet customer needs

Requirements

  • Bachelor's degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience
  • 15+ years of proven experience in compute platform engineering with a focus on automation
  • Experience in designing and deploying Containerization architectures and Distributed Systems Infrastructure
  • Proven experience evaluating existing application architectures and identify opportunities for containerization to improve scalability, reliability, and efficiency
  • Strong analytical skills with the ability to define and track key performance metrics
  • Experience in developing tools for data analysis and performance profiling, Development with Terraform, Config Management tools
  • Proficiency in programming languages such as Go and/or Python
  • Linux OS Proficiency with Kernel Internals
  • Experience with running large environments consisting of BareMetal Build Infrastructure
  • Understanding of Network Protocols and Architectures (VLAN/VxLAN/SDN/BGP/Anycast)
  • Hands-on experience with containers and its implementation
  • Deploying and Managing Services like DNS , LDAP at scale
  • Solid understanding of microservices architecture, infrastructure as code (IaC) and configuration management tools