FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Senior Solutions Architect, AI Cluster Performance, Telemetry
NVIDIASenior Solutions Architect specializing in Data Center Systems & Performance at NVIDIA. Analyzing and optimizing performance of world-class AI, deep learning, and HPC ecosystems.
Posted 6/4/2026full-timeSanta Clara • California, Texas • 🇺🇸 United StatesSenior💰 $184,000 - $356,500 per yearWebsite
Tech Stack
Tools & technologiesAnsibleCloudDockerGrafanaKubernetesPrometheus
About the role
Key responsibilities & impact- Work together with our partners and customers to identify, analyze, and resolve complex performance bottlenecks across interconnected GPU, CPU, and networking systems.
- Complete and maintain robust performance benchmarking suites to stress-test high-performance clusters and establish performance baselines.
- Apply industry-standard performance tools to monitor hardware performance counters and extract deep system telemetry.
- Deeply investigate system and software configurations to find and fix subtle discrepancies that impact peak performance.
- Partner closely with internal engineering units and outside collaborators and customers to collectively develop solutions and boost infrastructure performance.
Requirements
What you’ll need- BS or MS in Engineering, Electrical Engineering, Physics, or Computer Science (or equivalent experience).
- 8+ years of work-related experience in the high-tech industry, particularly in system build, performance analysis, and technical customer-facing roles.
- A strong understanding of how CPUs, GPUs, and high-speed networking fabrics interact within massive clusters.
- Practical experience with performance counters, profiling tools, and telemetry collection systems (e.g., Perf, eBPF, Prometheus, Grafana).
- Practical experience working with containers, cloud provisioning, and scheduling tools such as Docker, Docker Swarm, Kubernetes, SLURM, Ansible.
- Proven track record of transforming raw logs and telemetry into structured time series data, dashboards, and heat maps.
- The ability to translate complex, low-level technical performance anomalies into clear, actionable narratives for cross-functional teams.
- Strong collaborative skills and a proven history of building successful relationships across diverse engineering and operations teams.
Benefits
Comp & perks- Equity
- Benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
performance analysisperformance benchmarkingsystem telemetryperformance countersprofiling toolstelemetry collectioncloud provisioningscheduling toolstransforming raw logsdata visualization
Soft Skills
collaborative skillsrelationship buildingcommunicationproblem-solvinganalytical thinkingtechnical narrative translationcross-functional teamworkcustomer-facing skillsinvestigative skillsattention to detail