Infrastructure Engineer, GPU & Compute

Lightning AI

Infrastructure Engineer managing GPU and compute infrastructure at Lightning AI. Ensuring performance and reliability for AI/ML and HPC workloads through image management and diagnostics.

Posted 6/24/2026full-timeNew York City • California, New York, Washington • 🇺🇸 United StatesMid-LevelSenior💰 $180,000 - $200,000 per yearWebsite

Tech Stack

Tools & technologies

LinuxPython

About the role

Key responsibilities & impact

Own and evolve systems for image management, deployment, and validation across bare-metal infrastructure
Run and maintain test clusters used for system validation, diagnostics, and bring-up
Validate firmware, drivers, and OS images across compute and GPU-enabled systems
Support hardware qualification efforts for next-generation platforms
Own GPU diagnostics and validation workflows across large-scale infrastructure
Diagnose and resolve complex issues across GPUs, drivers, OS, and hardware layers
Analyze system and GPU performance using tools such as NVIDIA DCGM
Identify failure patterns and drive improvements in system stability and validation coverage
Build and maintain automation for provisioning, validation, and system bring-up
Develop Python-based tools and workflows to improve efficiency and reduce manual operational overhead
Improve the reliability, repeatability, and scalability of image pipelines and validation systems
Manage and operate Linux-based systems in production and validation environments
Manage virtualization technology
Support bare-metal provisioning workflows, including PXE and image-based systems
Interface with hardware management systems (e.g., IPMI, Redfish) for monitoring and debugging
Partner with Infrastructure, Hardware, and Data Center teams on system bring-up and validation
Collaborate with platform and ML teams to ensure systems meet workload requirements
Contribute to best practices for provisioning, diagnostics, and lifecycle management of infrastructure

Requirements

What you’ll need

5+ years of experience in infrastructure engineering, systems engineering, or related roles
Strong Linux systems experience in production environments
Hands-on experience with GPU-enabled systems and tools such as NVIDIA DCGM
Familiarity with bare-metal provisioning and system bring-up workflows
Proficiency in Python or similar scripting/programming languages for automation
Ability to debug complex issues across hardware, OS, GPUs, and system software.

Benefits

Comp & perks

Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
Generous paid time off, plus holidays
Paid parental leave
Professional development support
Wellness and work-from-home stipends
Flexible work environment

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Linux systemsPythonGPU diagnosticsfirmware validationdriver validationOS image validationautomationsystem performance analysisvirtualization technologybare-metal provisioning

Soft Skills

problem-solvingcollaborationcommunicationdiagnostic skillsorganizational skills