Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Lightning AI

Infrastructure Engineer, GPU & Compute

Lightning AI

Infrastructure Engineer managing GPU and compute infrastructure at Lightning AI. Ensuring performance and reliability for AI/ML and HPC workloads through image management and diagnostics.

Posted 6/24/2026full-timeNew York City • California, New York, Washington • 🇺🇸 United StatesMid-LevelSenior💰 $180,000 - $200,000 per yearWebsite

Tech Stack

Tools & technologies
LinuxPython

About the role

Key responsibilities & impact
  • Own and evolve systems for image management, deployment, and validation across bare-metal infrastructure
  • Run and maintain test clusters used for system validation, diagnostics, and bring-up
  • Validate firmware, drivers, and OS images across compute and GPU-enabled systems
  • Support hardware qualification efforts for next-generation platforms
  • Own GPU diagnostics and validation workflows across large-scale infrastructure
  • Diagnose and resolve complex issues across GPUs, drivers, OS, and hardware layers
  • Analyze system and GPU performance using tools such as NVIDIA DCGM
  • Identify failure patterns and drive improvements in system stability and validation coverage
  • Build and maintain automation for provisioning, validation, and system bring-up
  • Develop Python-based tools and workflows to improve efficiency and reduce manual operational overhead
  • Improve the reliability, repeatability, and scalability of image pipelines and validation systems
  • Manage and operate Linux-based systems in production and validation environments
  • Manage virtualization technology
  • Support bare-metal provisioning workflows, including PXE and image-based systems
  • Interface with hardware management systems (e.g., IPMI, Redfish) for monitoring and debugging
  • Partner with Infrastructure, Hardware, and Data Center teams on system bring-up and validation
  • Collaborate with platform and ML teams to ensure systems meet workload requirements
  • Contribute to best practices for provisioning, diagnostics, and lifecycle management of infrastructure

Requirements

What you’ll need
  • 5+ years of experience in infrastructure engineering, systems engineering, or related roles
  • Strong Linux systems experience in production environments
  • Hands-on experience with GPU-enabled systems and tools such as NVIDIA DCGM
  • Familiarity with bare-metal provisioning and system bring-up workflows
  • Proficiency in Python or similar scripting/programming languages for automation
  • Ability to debug complex issues across hardware, OS, GPUs, and system software.

Benefits

Comp & perks
  • Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
  • Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
  • Generous paid time off, plus holidays
  • Paid parental leave
  • Professional development support
  • Wellness and work-from-home stipends
  • Flexible work environment

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Linux systemsPythonGPU diagnosticsfirmware validationdriver validationOS image validationautomationsystem performance analysisvirtualization technologybare-metal provisioning
Soft Skills
problem-solvingcollaborationcommunicationdiagnostic skillsorganizational skills