Senior Operations Administrator, Service Reliability

NVIDIA

full-time

Posted on: 9/24/2025

Origin: • 🇺🇸 United States • California

✨ AI Apply

💰 $124,000 - $224,250 per year

Senior

AnsibleCloudDNSOpen SourcePythonShell Scripting

About the role

Design, develop and implement a global, dynamic Service Reliability Operations Center to support NVIDIA NGC Cloud products and services.
Partner with Site Reliability Engineering, Security Operations Center, DevOps teams, and other partners to drive near 100% availability.
Act as front line to decrease frequency and duration of incidents; discover incidents and initiate incident management procedures.
Develop monitors, alarms, and alerts; use alerts and alarms to help prevent issues.
Work with developer community to develop and implement predictive support or diagnostic routines.
Perform systems administration tasks, network administration tasks, and security incident monitoring.
Work with developers to learn services and translate that understanding into runbooks; update and evolve runbooks as features are added.
Bring in subject matter authorities or service owners as needed to resolve issues and provide feedback to improve service.
Provide 24/7 follow-the-sun support across continents; report directly to a manager in the United States.
May perform other tasks to provide extraordinary service levels for customers.

5+ years of experience administering open system servers in a Production environment.
3+ years of experience working in demanding Internet, Cloud, or Telecommunications environments in a Systems Administration, DevOps, SRE, or NOC role.
B.S. in relevant disciplines or equivalent experience.
Expertise using monitoring tools and problem ticketing systems.
Strong problem-solving, analytical, and troubleshooting abilities.
Strong server administration experience.
Shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, etc.
RHCE or equivalent level of knowledge.
Experience scripting in Python preferred, but not required.
Prior experience running virtual machines under open source or commercial hypervisors.
Experience operating services running on public or private clouds.
Knowledge and understanding of application containers and container orchestration systems.
Basic understanding of Git.
Experience performing system administration tasks using Ansible.
Prior experience analyzing system and network performance using monitoring alerts, data, and graphs.
Demonstrate ability to master and maintain complicated environments.
Some CIS shifts require either a Saturday or Sunday each week; hours may include an early or late start (10hrs-per-day x 4 days-per-week schedule).