
Senior System Software Engineer, NCCL – Partner Enablement
NVIDIA
full-time
Posted on:
Location Type: Remote
Location: California • Texas • United States
Visit company websiteExplore more
Salary
💰 $152,000 - $218,500 per year
Job Level
About the role
- Engage with our partners and customers to root cause functional and performance issues reported with NCCL
- Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters
- Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.)
- Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters
- Document and conduct trainings/webinars for NCCL
- Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support.
Requirements
- B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience.
- Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
- Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design
- Experience working with engineering or academic research community supporting HPC or AI
- Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control
- Expert in Linux fundamentals and a scripting language, preferably Python
- Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible)
- Adaptability and passion to learn new areas and tools
- Flexibility to work and communicate effectively across different teams and timezones
Benefits
- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
CC++parallel programmingMPINCCLUCXNVSHMEMperformance analysisdebuggingprofiling
Soft skills
adaptabilityflexibilitycommunicationtrainingcollaboration