Lead & Develop: Build, coach, and mentor a team of Super Intelligence HPC Support Engineers, ensuring technical excellence and strong execution in customer-facing work.
Escalation Ownership: Take point on high-visibility incidents and escalations with hyperscale customers, ensuring timely, transparent, and high-quality outcomes.
Customer Advocacy: Represent the needs of Super Intelligence customers in cross-functional discussions, influencing product design and roadmap decisions to improve supportability.
Incident Leadership: Guide your team through major incidents, driving consistency in communication, coordination, and resolution under pressure.
Operational Excellence: Define and refine support processes, runbooks, and documentation tailored to hyperscale environments.
Partnership: Collaborate closely with Product, Engineering, and Data Center teams to ensure Lambda delivers reliable, scalable solutions at the largest levels of deployment.
Metrics & Accountability: Monitor team performance, drive improvements in SLA adherence, response/resolution quality, and customer satisfaction.
Hands-On Leadership: Step in to troubleshoot complex issues and model the standard of excellence expected from your team.
Requirements
Proven track record leading technical support or engineering teams serving enterprise or hyperscale customers.
Skilled at managing customer escalations and major incidents with clarity, confidence, and urgency.
Deep expertise in HPC environments including GPU clusters, InfiniBand/RoCE networks, and Linux system administration.
Ability to guide engineers through troubleshooting at scale, from orchestration (Slurm/Kubernetes) down to kernel-level debugging.
Strong leadership presence: able to inspire, set direction, and build a culture of accountability and customer-first execution.
Excellent communication skills, capable of engaging with both engineers and executive stakeholders.
Advanced degree in Computer Science, Engineering, or related field (nice to have).
Certifications in HPC, networking, or related technologies (nice to have).
Experience with Slurm, Kubernetes, InfiniBand, and other high-performance interconnects (RoCE, NVLink/NVSwitch) (nice to have).
Background supporting Private Cloud environments or other dedicated enterprise clusters (nice to have).
Experience supporting enterprise AI workloads across startups and Fortune 500 companies (nice to have).
Benefits
Health, dental, and vision coverage for you and your dependents
Wellness and Commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible Paid Time Off Plan that we all actually use
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
HPC environmentsGPU clustersInfiniBandRoCE networksLinux system administrationSlurmKuberneteskernel-level debuggingtroubleshootingenterprise AI workloads