Senior Solutions Architect – AI Factory Deployment

NVIDIA

Senior Solutions Architect focused on developing AI factories at NVIDIA. Overseeing AI workloads and configuring multi-GPU clusters for performance optimization.

Posted 4/29/2026full-timeRemote • California, North Carolina, Texas • 🇺🇸 United StatesSenior💰 $184,000 - $287,500 per yearWebsite

Tech Stack

Tools & technologies

Distributed SystemsLinuxNode.jsPythonPyTorchTensorflow

About the role

Key responsibilities & impact

Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters.
Ensure configurations align with guidelines for NCCL, collectives, and distributed training frameworks.
Own the execution of key AI/LLM benchmarks, including setup, orchestration, result collection, and analysis.
Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform.
Build and improve observability for AI factories (metrics, logs, traces, dashboards) to understand workload behavior and system health.
Develop automation (Python, Shell) for running benchmarks, collecting results, and performing regression checks.
Examine communication patterns and NCCL usage for AI/LLM workloads, concentrating on collectives such as AllReduce and AllToAll.
Recommend changes to job configuration, parallelism strategies, and cluster settings to improve throughput, latency, and scaling efficiency.
Work closely with hardware, software, networking, datacenter, and product teams to prepare AI factories for customer use.
Contribute to documentation, guidelines, and readiness collateral that support internal collaborators and customer-facing teams.

Requirements

What you’ll need

Bachelor’s degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field.
More than 6+ years of experience managing Linux-based systems in HPC, distributed systems, or extensive AI/ML settings.
Hands-on experience running AI/ML workloads on multi-GPU and/or multi-node clusters, with practical knowledge of NCCL.
Solid grasp of collective communication patterns, particularly AllReduce and AllToAll, and how they are applied in contemporary ML/LLM training.
Familiarity with LLM training and/or inference workflows using frameworks such as PyTorch or TensorFlow.
Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
Experience with benchmarking (crafting, executing, and interpreting performance benchmarks).
Comfortable working with observability data (metrics, logs, dashboards) to troubleshoot and optimize complex distributed workloads.
Strong communication skills and the ability to work effectively with cross-functional teams.

Benefits

Comp & perks

Eligible for equity and benefits

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

LinuxAI/ML workloadsNCCLAllReduceAllToAllPythonShellbenchmarkingobservabilitydistributed systems

Soft Skills

communicationcross-functional collaboration

Certifications

Bachelor’s degree in Computer ScienceBachelor’s degree in MathematicsBachelor’s degree in EngineeringBachelor’s degree in Physics