
AI Infrastructure Engineer
Hydra Host
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $150,000 - $225,000 per year
About the role
- Get AI Platform customers production-ready on Hydra — standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware.
- Own the bare metal ←→ platform layer — bridging GPU infrastructure (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use.
- Configure, benchmark, and debug NVIDIA driver stacks — firmware versions, CUDA compatibility, NCCL tuning, MIG configurations.
- Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types.
- Identify gaps before customers do — pressure-testing Hydra's infrastructure, APIs, and workflows to find what's missing or broken.
- Turn customer learnings into product — working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding.
- Advise customers on chip selection and tokenomics — helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.
Requirements
- Bare metal Linux depth — you've administered GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration. Not just managed K8s.
- NVIDIA GPU stack expertise — drivers, CUDA, NCCL, NVLink, nvidia-smi profiling. You understand how stack compatibility affects performance.
- Kubernetes and orchestration — production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them.
- AI Networking fundamentals — TCP/IP, VLANs, bonding, and high-speed interconnects (InfiniBand, RoCE) for distributed workloads.
- Customer-facing communication — you can work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team.
- Bias toward scalable solutions — you'd rather build a feature that helps 10 customers than a custom deployment that helps 1.
- Nice to Have HPC or large-scale distributed training environments.
- AI workload experience (vLLM, PyTorch, inference frameworks).
- Storage systems (NVMe, distributed filesystems, CEPH, WEKA).
- IaC and provisioning tools (Terraform, Ansible, Cloud-init, MaaS).
Benefits
- Competitive salary
- Equity ownership
- Healthcare — medical, dental, vision for you and your family
- Remote-first — with hubs in Phoenix, Boulder, and Miami
- Direct impact — your work shapes how GPU infrastructure gets deployed across the AI ecosystem
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesNVIDIA driversCUDANCCLNVLinkSLURMAI workload experienceTCP/IPVLANsstorage configuration
Soft Skills
customer-facing communicationproblem-solvingcollaborationscalability focus