Hydra Host

AI Infrastructure Engineer

Hydra Host

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $150,000 - $225,000 per year

About the role

  • Get AI Platform customers production-ready on Hydra — standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware.
  • Own the bare metal ←→ platform layer — bridging GPU infrastructure (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use.
  • Configure, benchmark, and debug NVIDIA driver stacks — firmware versions, CUDA compatibility, NCCL tuning, MIG configurations.
  • Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types.
  • Identify gaps before customers do — pressure-testing Hydra's infrastructure, APIs, and workflows to find what's missing or broken.
  • Turn customer learnings into product — working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding.
  • Advise customers on chip selection and tokenomics — helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.

Requirements

  • Bare metal Linux depth — you've administered GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration. Not just managed K8s.
  • NVIDIA GPU stack expertise — drivers, CUDA, NCCL, NVLink, nvidia-smi profiling. You understand how stack compatibility affects performance.
  • Kubernetes and orchestration — production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them.
  • AI Networking fundamentals — TCP/IP, VLANs, bonding, and high-speed interconnects (InfiniBand, RoCE) for distributed workloads.
  • Customer-facing communication — you can work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team.
  • Bias toward scalable solutions — you'd rather build a feature that helps 10 customers than a custom deployment that helps 1.
  • Nice to Have HPC or large-scale distributed training environments.
  • AI workload experience (vLLM, PyTorch, inference frameworks).
  • Storage systems (NVMe, distributed filesystems, CEPH, WEKA).
  • IaC and provisioning tools (Terraform, Ansible, Cloud-init, MaaS).
Benefits
  • Competitive salary
  • Equity ownership
  • Healthcare — medical, dental, vision for you and your family
  • Remote-first — with hubs in Phoenix, Boulder, and Miami
  • Direct impact — your work shapes how GPU infrastructure gets deployed across the AI ecosystem
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
KubernetesNVIDIA driversCUDANCCLNVLinkSLURMAI workload experienceTCP/IPVLANsstorage configuration
Soft Skills
customer-facing communicationproblem-solvingcollaborationscalability focus