Salary
💰 $136,000 - $264,500 per year
Tech Stack
AnsibleCloudDockerKubernetesLinuxPuppetPythonSaltStack
About the role
- Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage.
- Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions
- Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud
- Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving needs
- Support our researchers to run their workloads including performance analysis and optimizations
- Conduct root cause analysis and suggest corrective action
- Proactively find and fix issues before they occur
Requirements
- Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
- Minimum 5+ years of experience designing and operating large scale compute infrastructure
- Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF
- Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
- Solid understanding of cluster configuration management tools such as Ansible, Puppet, Salt
- In-depth understanding of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud
- Proficiency in Python programming and bash scripting
- Applied experience with AI/HPC workflows that use MPI
- Experience analyzing and tuning performance for a variety of AI/HPC workloads.
- Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.
- equity
- benefits
📊 Resume Score
Upload your resume to see if it passes auto-rejection tools used by recruiters
Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
GPU-accelerated computingAIMLHPC systemsPythonbash scriptingMPIperformance analysiscluster configuration managementautomation solutions
Soft skills
leadershipstrategic guidancecustomer relationship managementcross-team collaborationproblem-solvingproactive issue resolutioncommunicationperformance optimizationroot cause analysiscontinual learning