FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Senior Software Engineer – NVLink Rack Scale Stability and Reliability
NVIDIASenior Software Engineer joining NVIDIA's Fabric Networking team to enhance NVLink and NVSwitch systems. Focused on stability and reliability for large-scale AI infrastructures.
Posted 5/23/2026full-timeRemote • Arizona, California, Colorado, Illinois • 🇺🇸 United StatesSenior💰 $152,000 - $241,500 per yearWebsite
Tech Stack
Tools & technologiesDistributed SystemsPythonShell ScriptingSwitchingTCP/IP
About the role
Key responsibilities & impact- Drive platform bringup, feature enablement, end-to-end software validation, and debug for next-generation NVLink-based GPU and rack-scale systems.
- Develop tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet support.
- Lead reliability and MTBI validation through stress testing, telemetry analysis, failure injection, and issue resolution.
- Triage complex software, firmware, networking, and platform issues across validation, deployment, and production environments.
- Collaborate with architecture, hardware, firmware, software, and Customer engagement teams to improve system quality and reliability.
- Build and maintain SRE-style validation infrastructure, including provisioning, monitoring, and operational readiness.
- Create automation, dashboards, runbooks, and debug workflows that improve root-cause analysis and operational efficiency.
Requirements
What you’ll need- BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field, or equivalent experience.
- 5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.
- Strong programming skills in C/C++ and Python; Bash/Shell scripting experience is a plus.
- Strong system-level debugging across software, firmware, hardware, and networking layers.
- Solid networking fundamentals, including TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis.
- Experience with large-scale AI systems, including platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging.
- Ability to triage complex multi-domain issues using logs, telemetry, experiments, and structured debugging methods.
- Strong communication and collaboration skills across engineering, customer, and operations teams.
- Passion for building reliable next-generation AI infrastructure and solving complex system-level challenges at scale.
Benefits
Comp & perks- Eligible for equity and benefits
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
CC++PythonBashShell scriptingsystem softwarefirmwarenetworkingdata center infrastructuredistributed systems
Soft Skills
communicationcollaborationproblem-solvingdebuggingtriageleadershipoperational efficiencyroot-cause analysisreliability engineeringpassion for AI infrastructure
Certifications
BS in Computer ScienceMS in Computer ScienceBS in Computer EngineeringMS in Computer EngineeringBS in Electrical EngineeringMS in Electrical Engineering