FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Senior Solutions Architect – AI Factory Deployment
NVIDIASenior Solutions Architect focused on developing AI factories at NVIDIA. Overseeing AI workloads and configuring multi-GPU clusters for performance optimization.
Posted 4/29/2026full-timeRemote • California, North Carolina, Texas • 🇺🇸 United StatesSenior💰 $184,000 - $287,500 per yearWebsite
Tech Stack
Tools & technologiesDistributed SystemsLinuxNode.jsPythonPyTorchTensorflow
About the role
Key responsibilities & impact- Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters.
- Ensure configurations align with guidelines for NCCL, collectives, and distributed training frameworks.
- Own the execution of key AI/LLM benchmarks, including setup, orchestration, result collection, and analysis.
- Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform.
- Build and improve observability for AI factories (metrics, logs, traces, dashboards) to understand workload behavior and system health.
- Develop automation (Python, Shell) for running benchmarks, collecting results, and performing regression checks.
- Examine communication patterns and NCCL usage for AI/LLM workloads, concentrating on collectives such as AllReduce and AllToAll.
- Recommend changes to job configuration, parallelism strategies, and cluster settings to improve throughput, latency, and scaling efficiency.
- Work closely with hardware, software, networking, datacenter, and product teams to prepare AI factories for customer use.
- Contribute to documentation, guidelines, and readiness collateral that support internal collaborators and customer-facing teams.
Requirements
What you’ll need- Bachelor’s degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field.
- More than 6+ years of experience managing Linux-based systems in HPC, distributed systems, or extensive AI/ML settings.
- Hands-on experience running AI/ML workloads on multi-GPU and/or multi-node clusters, with practical knowledge of NCCL.
- Solid grasp of collective communication patterns, particularly AllReduce and AllToAll, and how they are applied in contemporary ML/LLM training.
- Familiarity with LLM training and/or inference workflows using frameworks such as PyTorch or TensorFlow.
- Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
- Experience with benchmarking (crafting, executing, and interpreting performance benchmarks).
- Comfortable working with observability data (metrics, logs, dashboards) to troubleshoot and optimize complex distributed workloads.
- Strong communication skills and the ability to work effectively with cross-functional teams.
Benefits
Comp & perks- Eligible for equity and benefits
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
LinuxAI/ML workloadsNCCLAllReduceAllToAllPythonShellbenchmarkingobservabilitydistributed systems
Soft Skills
communicationcross-functional collaboration
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in MathematicsBachelor’s degree in EngineeringBachelor’s degree in Physics