NVIDIA

Senior Operations Administrator, Service Reliability

NVIDIA

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Apply

Salary

💰 $124,000 - $224,250 per year

Job Level

Senior

Tech Stack

AnsibleCloudDNSOpen SourcePythonShell Scripting

About the role

  • Design, develop and implement a global, dynamic Service Reliability Operations Center to support NVIDIA NGC Cloud products and services.
  • Partner with Site Reliability Engineering, Security Operations Center, DevOps teams, and other partners to drive near 100% availability.
  • Act as front line to decrease frequency and duration of incidents; discover incidents and initiate incident management procedures.
  • Develop monitors, alarms, and alerts; use alerts and alarms to help prevent issues.
  • Work with developer community to develop and implement predictive support or diagnostic routines.
  • Perform systems administration tasks, network administration tasks, and security incident monitoring.
  • Work with developers to learn services and translate that understanding into runbooks; update and evolve runbooks as features are added.
  • Bring in subject matter authorities or service owners as needed to resolve issues and provide feedback to improve service.
  • Provide 24/7 follow-the-sun support across continents; report directly to a manager in the United States.
  • May perform other tasks to provide extraordinary service levels for customers.

Requirements

  • 5+ years of experience administering open system servers in a Production environment.
  • 3+ years of experience working in demanding Internet, Cloud, or Telecommunications environments in a Systems Administration, DevOps, SRE, or NOC role.
  • B.S. in relevant disciplines or equivalent experience.
  • Expertise using monitoring tools and problem ticketing systems.
  • Strong problem-solving, analytical, and troubleshooting abilities.
  • Strong server administration experience.
  • Shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, etc.
  • RHCE or equivalent level of knowledge.
  • Experience scripting in Python preferred, but not required.
  • Prior experience running virtual machines under open source or commercial hypervisors.
  • Experience operating services running on public or private clouds.
  • Knowledge and understanding of application containers and container orchestration systems.
  • Basic understanding of Git.
  • Experience performing system administration tasks using Ansible.
  • Prior experience analyzing system and network performance using monitoring alerts, data, and graphs.
  • Demonstrate ability to master and maintain complicated environments.
  • Some CIS shifts require either a Saturday or Sunday each week; hours may include an early or late start (10hrs-per-day x 4 days-per-week schedule).
Verisys

Senior Systems Administrator

Verisys
Seniorfull-timeArizona, Colorado, Florida, Illinois, Kansas, Kentucky, North Carolina, Ohio, Oklahoma · 🇺🇸 United States
Posted: 35 days agoSource: jobs.lever.co
AnsibleAWSCloudDNSEC2LinuxPrometheusPythonTCP/IPTerraform
VetsEZ

Linux Administrator

VetsEZ
Mid · Seniorfull-time🇺🇸 United States
Posted: 29 days agoSource: vetsez.breezy.hr
AnsibleAWSAzureChefCloudDNSFirewallsLinuxOraclePuppetPythonShell Scripting+2 more
Fisher Investments

Senior Cloud Engineer

Fisher Investments
Seniorfull-timeFlorida · 🇺🇸 United States
Posted: 4 days agoSource: jobs-fishercareers.icims.com
AnsibleAzureCloudDNSFirewallsKubernetesLinuxPythonServiceNowSQLTerraform
Fisher Investments

Senior Cloud Engineer

Fisher Investments
Seniorfull-time$130k–$180k / yearWashington · 🇺🇸 United States
Posted: 1 day agoSource: jobs-fishercareers.icims.com
AnsibleAzureCloudDNSFirewallsKubernetesLinuxPythonServiceNowSQLTerraform
Protera

Network Operations Engineer

Protera
Mid · Seniorfull-time🇺🇸 United States
Posted: 6 days agoSource: apply.workable.com
AnsibleAWSCloudDNSFirewallsPythonSwitchingTCP/IPTerraform