Magic

Member of Technical Staff, Supercomputing Platform – Infrastructure

Magic

full-time

Posted on:

Location Type: Office

Location: San FranciscoCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $200,000 - $550,000 per year

Job Level

About the role

  • Design and operate large-scale GPU clusters for training and inference
  • Build and maintain infrastructure using Terraform across cloud and hybrid environments
  • Deploy, operate, and optimize K8s clusters used to schedule and manage AI workloads
  • Develop modular, scalable IaC patterns for compute, networking, and storage provisioning
  • Improve deployment reproducibility, environment consistency, and operational safety
  • Optimize networking and storage systems for high-throughput AI workloads
  • Automate fault detection and recovery across distributed clusters
  • Debug complex cross-layer issues spanning hardware, drivers, networking, storage, OS, and cloud
  • Improve observability, monitoring, and reliability of core platform systems

Requirements

  • Strong systems engineering fundamentals
  • Deep, hands-on experience with Terraform, including module design, state management, environment isolation, and large-scale deployments
  • Experience operating production GPU infrastructure or high-performance distributed systems
  • Strong understanding of networking and storage systems
  • Experience with major cloud platforms (GCP, AWS, Azure, OCI, etc.)
  • Track record of owning production-critical infrastructure end-to-end
Benefits
  • Equity is a significant part of total compensation, in addition to salary
  • 401(k) plan with 6% salary matching
  • Generous health, dental and vision insurance for you and your dependents
  • Unlimited paid time off
  • Visa sponsorship and relocation stipend to bring you to SF, if possible
  • A small, fast-paced, highly focused team
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
GPU clustersTerraformKubernetesInfrastructure as Code (IaC)networking systemsstorage systemsfault detectionrecovery automationdebugginghigh-performance distributed systems
Soft Skills
systems engineering fundamentalsproblem-solvingoperational safetyenvironment consistencydeployment reproducibilityobservabilitymonitoringreliability