FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAnsibleiOSLinuxPythonSaltStack
About the role
Key responsibilities & impact- Primary responsibilities will include building AI/HPC infrastructure for new and existing customers.
- Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, real-time monitoring, logging, and alerting.
- Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
Requirements
What you’ll need- BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields.
- At least 5+ years of professional experience in networking fundamentals, Ethernet or InfiniBand World.
- Hands-on experience with network switch/router platforms like Cumulus Linux, SONiC, IOS, JunosOS, and EOS, etc.
- Possess solid working knowledge of Ethernet/InfiniBand/RDMA core principles.
- Be proficient in end-to-end IB/Eth cluster deployment, adapter configuration and firmware maintenance, and able to conduct professional performance benchmarking with mainstream RDMA testing tools.
- Capable of independently diagnosing and troubleshooting typical IB/Eth network anomalies, including link flapping, connection failure, as well as bandwidth and latency jitter issues.
- Master practical RDMA network optimization strategies such as QP tuning, MTU configuration and congestion control optimization.
- Hands-on working experience in RDMA-accelerated business scenarios, including distributed storage and high-performance computing clusters.
- Extensive experience delivering automated network provisioning solutions using tools like Ansible, Salt, and Python.
- Ability to develop CI/CD pipelines for network operations.
- Strong written, verbal, and listening skills in English are essential.
Benefits
Comp & perks- NVIDIA pioneered accelerated computing.
- Our AI infrastructure powers global intelligence, transforming every industry.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AI infrastructureHPC infrastructurenetworking fundamentalsEthernetInfiniBandRDMAperformance benchmarkingnetwork optimizationCI/CD pipelinesautomated network provisioning
Soft Skills
problem-solvingcommunicationcollaborationfeedback provisionindependent diagnosistroubleshootingperformance improvementmonitoringdocumentationlistening
