FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Principal Software Engineer – Rack-Scale System Software
NVIDIAPrincipal Software Engineer managing rack-scale system software for CSPs at NVIDIA. Leading technical workstreams and collaboration across multi-functional teams for fleet management.
Posted 6/27/2026full-timeSanta Clara • California, Texas • 🇺🇸 United StatesLead💰 $272,000 - $431,250 per yearWebsite
Tech Stack
Tools & technologiesDistributed Systems
About the role
Key responsibilities & impact- Drive rack-scale SW/FW architecture alignment across CSP engagements — including fabric management software, link health monitoring, GPU/NVSwitch error handling, SW/FW serviceability features (e.g., hot-plug support, component isolation, firmware-driven recovery), and multi-component firmware orchestration
- Drive technical work streams with CSP engineering teams on rack-scale system software — ensuring they deeply understand fabric management, NVSwitch behavior, error handling and recovery policies, health telemetry APIs, and SW/FW-controlled recovery operation
- Capture and synthesize CSP engineering feedback on rack-scale system software — health monitoring APIs, SW-driven serviceability workflows, firmware update orchestration, and error recovery behavior — champion that feedback into NVIDIA's architecture decisions
- Collaborate with multi-functional teams to ensure customer operational requirements are reflected in system software and firmware development
- Identify cross-CSP patterns in rack-scale SW/FW issues, error handling behavior, and system configuration practices — drive documentation, tooling, and test strategy improvements as a result
- Collaborate with execution teams on left-shift strategy — ensuring customer-side SW/FW integration work is identified early and completed ahead of hardware availability
- Make critical technical decisions on rack-scale system SW/FW tradeoffs and mitigate execution risks through early engagement with CSP engineering teams
Requirements
What you’ll need- 15+ years of experience in system software, platform firmware, or large-scale distributed systems engineering.
- BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience)
- Deep understanding of rack-scale system software challenges: multi-component coordination, error propagation, health monitoring, and serviceability / reliability
- Experience with fabric management software, cluster management, or system-level orchestration frameworks.
- Familiarity with firmware architectures and update lifecycle management (multi-component update sequencing, rollback, recovery)
- Understanding of error handling and recovery design patterns in distributed systems — fault isolation, retry policies, graceful degradation
- Experience with health monitoring and telemetry systems: health scoring, event correlation, API design for fleet-level observability
- Understanding of GPU or accelerator system software (drivers, device management, power management) is a strong plus
- Customer obsession — genuine passion for understanding how CSPs operate sophisticated systems at fleet scale and simplifying their experience
- Proven success providing technical leadership across organizational boundaries and influencing system software design without direct authority.
- Strong communication — ability to translate complex system software architecture into actionable mentorship for customer engineering teams
Benefits
Comp & perks- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
System SoftwarePlatform FirmwareDistributed Systems EngineeringFabric Management SoftwareHealth MonitoringTelemetry SystemsError HandlingRecovery Design PatternsFirmware Update Lifecycle ManagementGPU System Software
Soft Skills
Technical LeadershipCommunicationCollaborationCustomer ObsessionMentorship
Certifications
BS in Computer ScienceMS in Electrical Engineering