NVIDIA

Senior Software Engineer, Cloud-Native Stack – CSP Engagements

NVIDIA

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Manual Apply

Salary

💰 $184,000 - $356,500 per year

Job Level

Senior

Tech Stack

AnsibleCloudDistributed SystemsGoKubernetesPrometheusPythonRustTerraform

About the role

  • Define customer workflows, prototype stack enhancements, and debug Kubernetes + Slurm issues in multi-rack, multi-tenant AI datacenters.
  • Perform deep-dive debugging of multi-rack, multi-tenant clusters: scheduler behavior, container runtime issues, device-plugin crashes, RDMA/IB fabric anomalies, etc.
  • Gather customer requirements and prototype feature extensions for Kubernetes operators, Slurm plugins, and custom micro-services that expose new GPU capabilities.
  • Drive joint architecture reviews and “whiteboard” sessions with CSP and internal platform teams; convert findings into RFCs and upstream pull requests.
  • Create reproducible testbeds (Helm/Ansible/Terraform) that mirror customer environments; automate validation and benchmark suites.
  • Deliver technical collateral—design docs, how-to guides, demo scripts—and present at customer on-sites, KubeCon, and SlurmUG.
  • Collaborate with AE, FAE, and Solution Architect teams to deliver integrated customer solutions and technical documentation.
  • Tackle complex scheduling challenges across racks, tenants, and clouds as part of CSP engagements team.

Requirements

  • Strong source-level expertise in Kubernetes internals (scheduler, CRI/CNI/CSI, operators) and Slurm (federation, power-save, plugins).
  • Hands-on experience integrating next-gen GPUs (Blackwell/GB200/GB300) or comparable accelerators into containerized clusters.
  • Proven track record debugging large-scale, cloud-native stacks across networking (RDMA/RoCE), storage, and control planes.
  • Customer-facing engineering or solutions-architect background: requirements gathering, PoC ownership, roadmap influence.
  • Familiarity with CI/CD (GitHub Actions, Tekton), observability (Prometheus, OpenTelemetry), and infrastructure-as-code.
  • Excellent communication-able to switch between deep technical detail and high-level business impact.
  • 6+ years of professional software development experience in distributed systems (Go, Rust, C/C++ or Python for tooling).
  • BS or MS (or equivalent experience) in Computer Engineering, Computer Science, or related field.
  • Upstream contributions to Kubernetes, Slurm, Volcano, or similar projects (ways to stand out).
  • Experience with GPU computing (CUDA) and deep learning workloads (ways to stand out).