
Senior Site Reliability Engineer, AI Factory
NVIDIA
full-time
Posted on:
Location Type: Remote
Location: California • United States
Visit company websiteExplore more
Salary
💰 $176,000 - $333,500 per year
Job Level
Tech Stack
About the role
- Running commissioning and provisioning for GPU systems
- Running the firmware versions of equipment and components, and communicating the supported versions across the organization
- Through Day-2 operations, keeping tight SLOs around efficiency, performance, and availability
- Monitoring the hardware state of the cluster, finding bottlenecks and hot spots, and helping users attain peak performance constantly
- Triaging the HW break-fix issues and making constant improvements using open-source break-fix solutions
- Collaborate with programming and technical divisions to define and implement repeatable procedures
- Develop and implement operations strategy & processes, maintaining consistency with SLAs across critically important infrastructure
- Develop and apply procedures for minimal downtime and quality controls to strive to achieve continuous uptime
- Feeding requirements to software and hardware teams
- Creation of documentation that the ecosystem can use to run its own AI Data Centers
Requirements
- BS or MS degree in Computer Engineering/Science, or related field (or equivalent experience) with 10+ overall years of meaningful work experience
- Experience managing GPU Fleets
- 10+ years of expertise in improving data center operations or critical infrastructure
- Expertise in BMS & Power management
- Background in working with Provisioning, Commissioning, and Config Management solutions
- Experience working with Packer and developing QCOW2 images
- Background in coordinating with remote hands
- Experience working with Datacenter Inventory Management Systems like Netbox, Nautilus, or others
- Proven track record of working with multiple teams to achieve operational excellence for an organization
- Experience driving reliability with robust processes, rapid field response, and recovery
Benefits
- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GPU systemsfirmware managementSLOsperformance monitoringbreak-fix solutionsoperations strategyquality controlsProvisioningCommissioningConfig Management
Soft Skills
collaborationcommunicationproblem-solvingoperational excellenceteam coordinationprocess improvementdocumentation
Certifications
BS degree in Computer EngineeringMS degree in Computer Science