RunPod

Manager, HPC Storage Engineer

RunPod

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $150,000 - $240,000 per year

Job Level

Tech Stack

About the role

  • Own Distributed Storage Architecture: Define, evolve, and operate Runpod’s global storage platforms, supporting training, inference, checkpointing, and dataset access at scale.
  • Build the Storage Engineering Team: Manage and grow a team of storage and systems engineers. Set clear ownership, technical direction, and operational standards across regions.
  • High-Performance Shared Filesystems: Design and operate large-scale SAN and NFS deployments, including performance-sensitive shared storage for GPU clusters.
  • Advanced Filesystems & Platforms: Lead deployments and operations of VAST Data and experience with Lustre or similar parallel filesystems used in HPC and AI environments.
  • End-to-End Performance Ownership: Drive performance optimization from NAND and NVMe media through controllers, networking, and client access patterns.
  • Next-Generation Storage Technologies: Evaluate and deploy cutting-edge capabilities such as NFS over RDMA, GPU Direct Storage (GDS), and low-latency data paths for accelerated workloads.
  • Reliability & Scale: Establish best practices for replication, data tiering, data protection, failure recovery, capacity planning, and lifecycle management.
  • Automation & Observability: Build automation for provisioning, expansion, upgrades, and monitoring. Ensure deep observability into throughput, latency, and error characteristics.
  • Cross-Functional Collaboration: Partner with Datacenter Networking, GPU Platform, SRE, and Product teams to ensure storage systems meet evolving workload and customer needs.
  • Vendor & Partner Management: Own technical relationships with storage vendors, hardware partners, and colocation providers; drive roadmap alignment and issue resolution.

Requirements

  • 3+ years managing storage, systems, or infrastructure engineering teams in production environments.
  • 8+ years designing and operating large-scale storage systems, including SAN and NFS architectures at multi-petabyte scale.
  • Hands-on experience deploying, operating, or deeply integrating VAST Data in production environments is required.
  • Experience with Lustre or comparable HPC filesystems (e.g., GPFS, BeeGFS) supporting high-concurrency workloads.
  • Deep understanding of NAND, NVMe, PCIe, storage controllers, and performance characteristics across the stack.
  • Proven experience with NFS over RDMA, RDMA-capable transports, or similar technologies. Familiarity with GPU Direct Storage strongly preferred.
  • Strong Linux internals knowledge, including filesystems, I/O scheduling, memory management, and tuning for performance workloads.
  • Experience running 24/7 storage platforms with strong incident response, change management, and post-mortem discipline.
  • Ability to clearly communicate complex technical tradeoffs and lead teams through high-stakes infrastructure decisions.
  • Successful completion of a background check.
Benefits
  • 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Distributed Storage ArchitectureSANNFSVAST DataLustreNANDNVMeGPU Direct StorageAutomationLinux internals
Soft Skills
team managementcross-functional collaborationcommunicationincident responsechange managementpost-mortem disciplinetechnical directionoperational standardsperformance optimizationproblem-solving