Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
NVIDIA

Senior Solutions Architect – AI Factory Deployment

NVIDIA

Senior Solutions Architect focused on developing AI factories at NVIDIA. Overseeing AI workloads and configuring multi-GPU clusters for performance optimization.

Posted 4/29/2026full-timeRemote • California, North Carolina, Texas • 🇺🇸 United StatesSenior💰 $184,000 - $287,500 per yearWebsite

Tech Stack

Tools & technologies
Distributed SystemsLinuxNode.jsPythonPyTorchTensorflow

About the role

Key responsibilities & impact
  • Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters.
  • Ensure configurations align with guidelines for NCCL, collectives, and distributed training frameworks.
  • Own the execution of key AI/LLM benchmarks, including setup, orchestration, result collection, and analysis.
  • Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform.
  • Build and improve observability for AI factories (metrics, logs, traces, dashboards) to understand workload behavior and system health.
  • Develop automation (Python, Shell) for running benchmarks, collecting results, and performing regression checks.
  • Examine communication patterns and NCCL usage for AI/LLM workloads, concentrating on collectives such as AllReduce and AllToAll.
  • Recommend changes to job configuration, parallelism strategies, and cluster settings to improve throughput, latency, and scaling efficiency.
  • Work closely with hardware, software, networking, datacenter, and product teams to prepare AI factories for customer use.
  • Contribute to documentation, guidelines, and readiness collateral that support internal collaborators and customer-facing teams.

Requirements

What you’ll need
  • Bachelor’s degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field.
  • More than 6+ years of experience managing Linux-based systems in HPC, distributed systems, or extensive AI/ML settings.
  • Hands-on experience running AI/ML workloads on multi-GPU and/or multi-node clusters, with practical knowledge of NCCL.
  • Solid grasp of collective communication patterns, particularly AllReduce and AllToAll, and how they are applied in contemporary ML/LLM training.
  • Familiarity with LLM training and/or inference workflows using frameworks such as PyTorch or TensorFlow.
  • Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
  • Experience with benchmarking (crafting, executing, and interpreting performance benchmarks).
  • Comfortable working with observability data (metrics, logs, dashboards) to troubleshoot and optimize complex distributed workloads.
  • Strong communication skills and the ability to work effectively with cross-functional teams.

Benefits

Comp & perks
  • Eligible for equity and benefits

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
LinuxAI/ML workloadsNCCLAllReduceAllToAllPythonShellbenchmarkingobservabilitydistributed systems
Soft Skills
communicationcross-functional collaboration
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in MathematicsBachelor’s degree in EngineeringBachelor’s degree in Physics