
Senior Software Engineer, AI Frameworks
NVIDIA
full-time
Posted on:
Location Type: Remote
Location: California • United States
Visit company websiteExplore more
Salary
💰 $152,000 - $241,500 per year
Job Level
About the role
- Design and implement end-to-end integrations of Grove with open-source AI frameworks (e.g., Dynamo, llm-d, Ray, PyTorch, and related ecosystem projects)
- Build and maintain adapters, plugins, operators, and/or runtime components that enable Grove features to work smoothly across training and inference stacks
- Partner with framework owners to upstream changes, contribute patches, and ensure long-term maintainability of integrations
- Develop reference workflows, sample apps, and best-practice guides that accelerate adoption by users and partners
- Optimize performance, scalability, and reliability for distributed training/inference, including multi-node and multi-GPU environments
- Improve observability and operational readiness (metrics, logging, tracing, debugging tools) for Kubernetes-based deployments
- Participate in technical design reviews, define APIs/contracts, and ensure compatibility across versions of frameworks and dependencies
- Diagnose complex issues spanning containers, networking, scheduling, CUDA/GPU utilization, and framework runtime behavior.
Requirements
- BS/MS/PhD in Computer Science, Electrical Engineering, or related field (or equivalent experience)
- 5+ years of proven experience in related field
- Hands-on experience integrating with at least one major AI framework/runtime (e.g., PyTorch, Ray, Triton Inference Server ecosystem, distributed runtimes, model serving stacks)
- Solid understanding of AI workloads: model development basics, training vs. inference tradeoffs, and performance considerations (throughput/latency, batching, memory)
- Experience with distributed systems concepts (RPC, scheduling, fault tolerance, resource management)
- Practical Kubernetes experience: deploying and operating services/jobs, Helm/Kustomize, operators/controllers (nice to have), and debugging clusters
- Familiarity with containers and cloud-native tooling (Docker, container registries, CI/CD pipelines)
- Strong software engineering experience in Go, C++ and/or Python, with a track record of shipping reliable systems
- Strong interpersonal skills and ability to collaborate across teams and with open-source communities
- Exceptional collaboration, communication, and documentation habits.
Benefits
- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AI frameworksPyTorchRayGoC++PythonKubernetesDockerdistributed systemsmodel serving
Soft Skills
interpersonal skillscollaborationcommunicationdocumentation
Certifications
BS in Computer ScienceMS in Computer SciencePhD in Computer ScienceBS in Electrical EngineeringMS in Electrical EngineeringPhD in Electrical Engineering