
Member of Technical Staff, Supercomputing Platform – Infrastructure
Magic
full-time
Posted on:
Location Type: Office
Location: San Francisco • California • United States
Visit company websiteExplore more
Salary
💰 $200,000 - $550,000 per year
Job Level
About the role
- Design and operate large-scale GPU clusters for training and inference
- Build and maintain infrastructure using Terraform across cloud and hybrid environments
- Deploy, operate, and optimize K8s clusters used to schedule and manage AI workloads
- Develop modular, scalable IaC patterns for compute, networking, and storage provisioning
- Improve deployment reproducibility, environment consistency, and operational safety
- Optimize networking and storage systems for high-throughput AI workloads
- Automate fault detection and recovery across distributed clusters
- Debug complex cross-layer issues spanning hardware, drivers, networking, storage, OS, and cloud
- Improve observability, monitoring, and reliability of core platform systems
Requirements
- Strong systems engineering fundamentals
- Deep, hands-on experience with Terraform, including module design, state management, environment isolation, and large-scale deployments
- Experience operating production GPU infrastructure or high-performance distributed systems
- Strong understanding of networking and storage systems
- Experience with major cloud platforms (GCP, AWS, Azure, OCI, etc.)
- Track record of owning production-critical infrastructure end-to-end
Benefits
- Equity is a significant part of total compensation, in addition to salary
- 401(k) plan with 6% salary matching
- Generous health, dental and vision insurance for you and your dependents
- Unlimited paid time off
- Visa sponsorship and relocation stipend to bring you to SF, if possible
- A small, fast-paced, highly focused team
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GPU clustersTerraformKubernetesInfrastructure as Code (IaC)networking systemsstorage systemsfault detectionrecovery automationdebugginghigh-performance distributed systems
Soft Skills
systems engineering fundamentalsproblem-solvingoperational safetyenvironment consistencydeployment reproducibilityobservabilitymonitoringreliability