Salary
💰 $124,000 - $224,250 per year
Tech Stack
AnsibleCloudDNSOpen SourcePythonShell Scripting
About the role
- Design, develop and implement a global, dynamic Service Reliability Operations Center to support NVIDIA NGC Cloud products and services.
- Partner with Site Reliability Engineering, Security Operations Center, DevOps teams, and other partners to drive near 100% availability.
- Act as front line to decrease frequency and duration of incidents; discover incidents and initiate incident management procedures.
- Develop monitors, alarms, and alerts; use alerts and alarms to help prevent issues.
- Work with developer community to develop and implement predictive support or diagnostic routines.
- Perform systems administration tasks, network administration tasks, and security incident monitoring.
- Work with developers to learn services and translate that understanding into runbooks; update and evolve runbooks as features are added.
- Bring in subject matter authorities or service owners as needed to resolve issues and provide feedback to improve service.
- Provide 24/7 follow-the-sun support across continents; report directly to a manager in the United States.
- May perform other tasks to provide extraordinary service levels for customers.
Requirements
- 5+ years of experience administering open system servers in a Production environment.
- 3+ years of experience working in demanding Internet, Cloud, or Telecommunications environments in a Systems Administration, DevOps, SRE, or NOC role.
- B.S. in relevant disciplines or equivalent experience.
- Expertise using monitoring tools and problem ticketing systems.
- Strong problem-solving, analytical, and troubleshooting abilities.
- Strong server administration experience.
- Shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, etc.
- RHCE or equivalent level of knowledge.
- Experience scripting in Python preferred, but not required.
- Prior experience running virtual machines under open source or commercial hypervisors.
- Experience operating services running on public or private clouds.
- Knowledge and understanding of application containers and container orchestration systems.
- Basic understanding of Git.
- Experience performing system administration tasks using Ansible.
- Prior experience analyzing system and network performance using monitoring alerts, data, and graphs.
- Demonstrate ability to master and maintain complicated environments.
- Some CIS shifts require either a Saturday or Sunday each week; hours may include an early or late start (10hrs-per-day x 4 days-per-week schedule).