NVIDIA

Senior Software Engineer, AI Frameworks

NVIDIA

full-time

Posted on:

Location Type: Remote

Location: CaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $152,000 - $241,500 per year

Job Level

About the role

  • Design and implement end-to-end integrations of Grove with open-source AI frameworks (e.g., Dynamo, llm-d, Ray, PyTorch, and related ecosystem projects)
  • Build and maintain adapters, plugins, operators, and/or runtime components that enable Grove features to work smoothly across training and inference stacks
  • Partner with framework owners to upstream changes, contribute patches, and ensure long-term maintainability of integrations
  • Develop reference workflows, sample apps, and best-practice guides that accelerate adoption by users and partners
  • Optimize performance, scalability, and reliability for distributed training/inference, including multi-node and multi-GPU environments
  • Improve observability and operational readiness (metrics, logging, tracing, debugging tools) for Kubernetes-based deployments
  • Participate in technical design reviews, define APIs/contracts, and ensure compatibility across versions of frameworks and dependencies
  • Diagnose complex issues spanning containers, networking, scheduling, CUDA/GPU utilization, and framework runtime behavior.

Requirements

  • BS/MS/PhD in Computer Science, Electrical Engineering, or related field (or equivalent experience)
  • 5+ years of proven experience in related field
  • Hands-on experience integrating with at least one major AI framework/runtime (e.g., PyTorch, Ray, Triton Inference Server ecosystem, distributed runtimes, model serving stacks)
  • Solid understanding of AI workloads: model development basics, training vs. inference tradeoffs, and performance considerations (throughput/latency, batching, memory)
  • Experience with distributed systems concepts (RPC, scheduling, fault tolerance, resource management)
  • Practical Kubernetes experience: deploying and operating services/jobs, Helm/Kustomize, operators/controllers (nice to have), and debugging clusters
  • Familiarity with containers and cloud-native tooling (Docker, container registries, CI/CD pipelines)
  • Strong software engineering experience in Go, C++ and/or Python, with a track record of shipping reliable systems
  • Strong interpersonal skills and ability to collaborate across teams and with open-source communities
  • Exceptional collaboration, communication, and documentation habits.
Benefits
  • equity
  • benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
AI frameworksPyTorchRayGoC++PythonKubernetesDockerdistributed systemsmodel serving
Soft Skills
interpersonal skillscollaborationcommunicationdocumentation
Certifications
BS in Computer ScienceMS in Computer SciencePhD in Computer ScienceBS in Electrical EngineeringMS in Electrical EngineeringPhD in Electrical Engineering