FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Infrastructure Engineer, GPU & Compute
Lightning AIInfrastructure Engineer managing GPU and compute infrastructure at Lightning AI. Ensuring performance and reliability for AI/ML and HPC workloads through image management and diagnostics.
Posted 6/24/2026full-timeNew York City • California, New York, Washington • 🇺🇸 United StatesMid-LevelSenior💰 $180,000 - $200,000 per yearWebsite
Tech Stack
Tools & technologiesLinuxPython
About the role
Key responsibilities & impact- Own and evolve systems for image management, deployment, and validation across bare-metal infrastructure
- Run and maintain test clusters used for system validation, diagnostics, and bring-up
- Validate firmware, drivers, and OS images across compute and GPU-enabled systems
- Support hardware qualification efforts for next-generation platforms
- Own GPU diagnostics and validation workflows across large-scale infrastructure
- Diagnose and resolve complex issues across GPUs, drivers, OS, and hardware layers
- Analyze system and GPU performance using tools such as NVIDIA DCGM
- Identify failure patterns and drive improvements in system stability and validation coverage
- Build and maintain automation for provisioning, validation, and system bring-up
- Develop Python-based tools and workflows to improve efficiency and reduce manual operational overhead
- Improve the reliability, repeatability, and scalability of image pipelines and validation systems
- Manage and operate Linux-based systems in production and validation environments
- Manage virtualization technology
- Support bare-metal provisioning workflows, including PXE and image-based systems
- Interface with hardware management systems (e.g., IPMI, Redfish) for monitoring and debugging
- Partner with Infrastructure, Hardware, and Data Center teams on system bring-up and validation
- Collaborate with platform and ML teams to ensure systems meet workload requirements
- Contribute to best practices for provisioning, diagnostics, and lifecycle management of infrastructure
Requirements
What you’ll need- 5+ years of experience in infrastructure engineering, systems engineering, or related roles
- Strong Linux systems experience in production environments
- Hands-on experience with GPU-enabled systems and tools such as NVIDIA DCGM
- Familiarity with bare-metal provisioning and system bring-up workflows
- Proficiency in Python or similar scripting/programming languages for automation
- Ability to debug complex issues across hardware, OS, GPUs, and system software.
Benefits
Comp & perks- Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
- Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
- Generous paid time off, plus holidays
- Paid parental leave
- Professional development support
- Wellness and work-from-home stipends
- Flexible work environment
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Linux systemsPythonGPU diagnosticsfirmware validationdriver validationOS image validationautomationsystem performance analysisvirtualization technologybare-metal provisioning
Soft Skills
problem-solvingcollaborationcommunicationdiagnostic skillsorganizational skills