Get AI Platform customers production-ready on Hydra — standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware.
Own the bare metal ←→ platform layer — bridging GPU infrastructure (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use.
Configure, benchmark, and debug NVIDIA driver stacks — firmware versions, CUDA compatibility, NCCL tuning, MIG configurations.
Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types.
Identify gaps before customers do — pressure-testing Hydra's infrastructure, APIs, and workflows to find what's missing or broken.
Turn customer learnings into product — working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding.
Advise customers on chip selection and tokenomics — helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.

Requirements

Bare metal Linux depth — you've administered GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration. Not just managed K8s.
NVIDIA GPU stack expertise — drivers, CUDA, NCCL, NVLink, nvidia-smi profiling. You understand how stack compatibility affects performance.
Kubernetes and orchestration — production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them.
AI Networking fundamentals — TCP/IP, VLANs, bonding, and high-speed interconnects (InfiniBand, RoCE) for distributed workloads.
Customer-facing communication — you can work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team.
Bias toward scalable solutions — you'd rather build a feature that helps 10 customers than a custom deployment that helps 1.
Nice to Have HPC or large-scale distributed training environments.
AI workload experience (vLLM, PyTorch, inference frameworks).
Storage systems (NVMe, distributed filesystems, CEPH, WEKA).
IaC and provisioning tools (Terraform, Ansible, Cloud-init, MaaS).

Benefits

Competitive salary
Equity ownership
Healthcare — medical, dental, vision for you and your family
Remote-first — with hubs in Phoenix, Boulder, and Miami
Direct impact — your work shapes how GPU infrastructure gets deployed across the AI ecosystem

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

KubernetesNVIDIA driversCUDANCCLNVLinkSLURMAI workload experienceTCP/IPVLANsstorage configuration

Soft Skills

customer-facing communicationproblem-solvingcollaborationscalability focus