Design and operate large-scale GPU clusters for training and inference
Build and maintain infrastructure using Terraform across cloud and hybrid environments
Deploy, operate, and optimize K8s clusters used to schedule and manage AI workloads
Develop modular, scalable IaC patterns for compute, networking, and storage provisioning
Improve deployment reproducibility, environment consistency, and operational safety
Optimize networking and storage systems for high-throughput AI workloads
Automate fault detection and recovery across distributed clusters
Debug complex cross-layer issues spanning hardware, drivers, networking, storage, OS, and cloud
Improve observability, monitoring, and reliability of core platform systems

Requirements

Strong systems engineering fundamentals
Deep, hands-on experience with Terraform, including module design, state management, environment isolation, and large-scale deployments
Experience operating production GPU infrastructure or high-performance distributed systems
Strong understanding of networking and storage systems
Experience with major cloud platforms (GCP, AWS, Azure, OCI, etc.)
Track record of owning production-critical infrastructure end-to-end

Benefits

Equity is a significant part of total compensation, in addition to salary
401(k) plan with 6% salary matching
Generous health, dental and vision insurance for you and your dependents
Unlimited paid time off
Visa sponsorship and relocation stipend to bring you to SF, if possible
A small, fast-paced, highly focused team

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

GPU clustersTerraformKubernetesInfrastructure as Code (IaC)networking systemsstorage systemsfault detectionrecovery automationdebugginghigh-performance distributed systems

Soft Skills

systems engineering fundamentalsproblem-solvingoperational safetyenvironment consistencydeployment reproducibilityobservabilitymonitoringreliability