Senior Inference Engineer, AIConfigurator

NVIDIA

Senior Inference Engineer optimizing large-scale LLM serving for NVIDIA AIConfigurator. Building APIs, collaborating with teams, and enhancing model performance on NVIDIA platforms.

Posted 6/13/2026full-timeSanta Clara • California • 🇺🇸 United StatesSenior💰 $184,000 - $356,500 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

PythonRustAPIsCLIsSDKDynamoKubernetesTensorRT-LLMvLLMSGLang

Soft Skills

collaborationcommunicationproblem-solvingdocumentationautomationsoftware qualitymaintainable architecturetestingdebuggingperformance analysis

Tools & Technologies

NVIDIA GPU clustersH100H200B200GB200performance databasesprofiling datavalidation toolsopen-source softwareproduction systems

Industry Keywords

LLM servingconfiguration generationinference runtimebenchmarkingsimulationoptimizationresource managementparallelism strategieslatencyefficiency

Tech Stack

Tools & technologies

Distributed SystemsKubernetesPythonRust

About the role

Key responsibilities & impact

Build and evolve AIConfigurator's core optimization engine for LLM serving, including configuration search, SLA-aware ranking, efficiency and latency estimation, and Pareto frontier analysis.
Build production-quality Python/Rust APIs, CLIs, SDK surfaces, and web workflows that help users generate strong deployment configurations for NVIDIA GPU clusters.
Develop configuration generation systems that emit backend-specific artifacts for Dynamo, Kubernetes, TensorRT-LLM, vLLM, and SGLang deployments.
Collaborate with inference runtime, performance, benchmarking, and product groups to ensure simulated results correspond with actual deployment performance on H100, H200, B200, GB200, and upcoming NVIDIA platforms.
Improve model, hardware, and backend support by integrating performance databases, profiling data, support matrices, and validation tools.
Drive software quality through maintainable architecture, schema development, tests, documentation, and automation suitable for open-source and production users.
Convert intricate inference ideas like prefill/decode disaggregation, tensor parallelism, pipeline parallelism, expert parallelism, batching, and KV cache behavior into dependable software abstractions.

Requirements

What you’ll need

BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Math, or a related field, or equivalent experience.
10+ years of relevant software engineering experience.
Strong Python/Rust engineering skills, including production APIs, CLI tools, packaging, testing, debugging, and maintainable software development.
Experience with GPU computing, distributed systems, ML infrastructure, or high-performance model serving.
Understanding of LLM inference concepts such as batching, latency, efficiency, memory constraints, parallelism strategies, and serving SLAs.
Experience working with data-driven performance analysis, benchmarking, simulation, optimization, or managing resource needs.
Ability to collaborate across research, runtime, platform, and customer-facing engineering teams.
Strong written and verbal communication skills, with the ability to explain sophisticated technical tradeoffs clearly.

Benefits

Comp & perks

equity
benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score