Cisco

AI Infrastructure Engineer – HPC

Cisco

full-time

Posted on:

Location Type: Hybrid

Location: RTPNorth CarolinaTexasUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $138,000 - $176,000 per year

Job Level

About the role

  • Technical hands-on role in building and supporting NVIDIA & Cisco UCS based artificial intelligence platforms.
  • Plan, build, and install/upgrade new systems that support NVIDIA DGX and Cisco UCS hardware and software.
  • Automate configuration management, software updates, and maintenance and monitoring of GPU system availability using modern DevOps tools (Ansible, GitLab, etc.).
  • Evaluate system performance based on industry-relevant benchmarks.
  • Identify and optimize performance bottlenecks to drive system and workflow efficiency.
  • Administer Linux systems, ranging from powerful GPU-enabled servers to general-purpose compute systems.
  • Collaborate closely with internal Cisco Business Units, application teams, and cross-functional technical domains.
  • Create written technical designs, documents, and presentations.
  • Stay up to date with AI industry advancements and cutting-edge technologies.
  • Accelerate the delivery of AI capabilities across our portfolio.
  • Design new tools to monitor alerts that will help discover failures or issues before our customers.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.

Requirements

  • 7+ years of previous experience deploying and administrating HPC clusters
  • Proficient in general-purpose programming languages (Python, GoLang, Bash and/or C/C++) and development platforms and technologies.
  • Familiar with GPU resource scheduling managers (Slurm (preferred), Kubernetes, and/or RunAI, etc.)
  • Master's degree or equivalent work experience (preferred)
  • Proficient in Hybrid Cloud, Virtualization, and Container technologies
  • Experience with provisioning tools like Base Command Manager, Warewulf, Satellite, and/or Ironic
  • Experience with Agile and DevOps operating models, including project tracking tools (e.g., Jira), Git, and CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins)
  • Experience with automation tools like Ansible, SaltStack, Puppet and/or Chef
  • Deep understanding of operating systems, computer networks, and high-performance applications.
  • Established record of leading technical initiatives, delivering results, and a commitment to fostering a supportive work environment.
Benefits
  • medical, dental and vision insurance
  • a 401(k) plan with a Cisco matching contribution
  • paid parental leave
  • short and long-term disability coverage
  • basic life insurance
  • 10 paid holidays per full calendar year
  • 1 floating holiday for non-exempt employees
  • 1 paid day off for employee’s birthday
  • paid year-end holiday shutdown
  • 4 paid days off for personal wellness
  • paid vacation time
  • flexible vacation time off program
  • 80 hours of sick time off
  • optional 10 paid days per full calendar year to volunteer
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonGoLangBashC/C++GPU resource scheduling managersSlurmHybrid CloudVirtualizationContainer technologiesHigh-performance applications
Soft Skills
collaborationtechnical initiative leadershipresults deliverysupportive work environment
Certifications
Master's degree