NVIDIA

Senior Site Reliability Engineer, AI Factory

NVIDIA

full-time

Posted on:

Location Type: Remote

Location: CaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $176,000 - $333,500 per year

Job Level

Tech Stack

About the role

  • Running commissioning and provisioning for GPU systems
  • Running the firmware versions of equipment and components, and communicating the supported versions across the organization
  • Through Day-2 operations, keeping tight SLOs around efficiency, performance, and availability
  • Monitoring the hardware state of the cluster, finding bottlenecks and hot spots, and helping users attain peak performance constantly
  • Triaging the HW break-fix issues and making constant improvements using open-source break-fix solutions
  • Collaborate with programming and technical divisions to define and implement repeatable procedures
  • Develop and implement operations strategy & processes, maintaining consistency with SLAs across critically important infrastructure
  • Develop and apply procedures for minimal downtime and quality controls to strive to achieve continuous uptime
  • Feeding requirements to software and hardware teams
  • Creation of documentation that the ecosystem can use to run its own AI Data Centers

Requirements

  • BS or MS degree in Computer Engineering/Science, or related field (or equivalent experience) with 10+ overall years of meaningful work experience
  • Experience managing GPU Fleets
  • 10+ years of expertise in improving data center operations or critical infrastructure
  • Expertise in BMS & Power management
  • Background in working with Provisioning, Commissioning, and Config Management solutions
  • Experience working with Packer and developing QCOW2 images
  • Background in coordinating with remote hands
  • Experience working with Datacenter Inventory Management Systems like Netbox, Nautilus, or others
  • Proven track record of working with multiple teams to achieve operational excellence for an organization
  • Experience driving reliability with robust processes, rapid field response, and recovery
Benefits
  • equity
  • benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
GPU systemsfirmware managementSLOsperformance monitoringbreak-fix solutionsoperations strategyquality controlsProvisioningCommissioningConfig Management
Soft Skills
collaborationcommunicationproblem-solvingoperational excellenceteam coordinationprocess improvementdocumentation
Certifications
BS degree in Computer EngineeringMS degree in Computer Science