
AI Infrastructure Engineer – HPC
Cisco
full-time
Posted on:
Location Type: Hybrid
Location: RTP • North Carolina • Texas • United States
Visit company websiteExplore more
Salary
💰 $138,000 - $176,000 per year
About the role
- Technical hands-on role in building and supporting NVIDIA & Cisco UCS based artificial intelligence platforms.
- Plan, build, and install/upgrade new systems that support NVIDIA DGX and Cisco UCS hardware and software.
- Automate configuration management, software updates, and maintenance and monitoring of GPU system availability using modern DevOps tools (Ansible, GitLab, etc.).
- Evaluate system performance based on industry-relevant benchmarks.
- Identify and optimize performance bottlenecks to drive system and workflow efficiency.
- Administer Linux systems, ranging from powerful GPU-enabled servers to general-purpose compute systems.
- Collaborate closely with internal Cisco Business Units, application teams, and cross-functional technical domains.
- Create written technical designs, documents, and presentations.
- Stay up to date with AI industry advancements and cutting-edge technologies.
- Accelerate the delivery of AI capabilities across our portfolio.
- Design new tools to monitor alerts that will help discover failures or issues before our customers.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Requirements
- 7+ years of previous experience deploying and administrating HPC clusters
- Proficient in general-purpose programming languages (Python, GoLang, Bash and/or C/C++) and development platforms and technologies.
- Familiar with GPU resource scheduling managers (Slurm (preferred), Kubernetes, and/or RunAI, etc.)
- Master's degree or equivalent work experience (preferred)
- Proficient in Hybrid Cloud, Virtualization, and Container technologies
- Experience with provisioning tools like Base Command Manager, Warewulf, Satellite, and/or Ironic
- Experience with Agile and DevOps operating models, including project tracking tools (e.g., Jira), Git, and CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins)
- Experience with automation tools like Ansible, SaltStack, Puppet and/or Chef
- Deep understanding of operating systems, computer networks, and high-performance applications.
- Established record of leading technical initiatives, delivering results, and a commitment to fostering a supportive work environment.
Benefits
- medical, dental and vision insurance
- a 401(k) plan with a Cisco matching contribution
- paid parental leave
- short and long-term disability coverage
- basic life insurance
- 10 paid holidays per full calendar year
- 1 floating holiday for non-exempt employees
- 1 paid day off for employee’s birthday
- paid year-end holiday shutdown
- 4 paid days off for personal wellness
- paid vacation time
- flexible vacation time off program
- 80 hours of sick time off
- optional 10 paid days per full calendar year to volunteer
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonGoLangBashC/C++GPU resource scheduling managersSlurmHybrid CloudVirtualizationContainer technologiesHigh-performance applications
Soft Skills
collaborationtechnical initiative leadershipresults deliverysupportive work environment
Certifications
Master's degree