Deploy, configure, and maintain on-premises GPU servers — primarily NVIDIA H200 and A100 nodes.
Implement and tune NVIDIA-specific tooling: DCGM (Data Center GPU Manager), MIG (Multi-Instance GPU) partitioning, and NVIDIA Container Toolkit.
Manage bare-metal provisioning workflows (iPXE, PXE, MAAS/Foreman) for repeatable server builds.
Monitor hardware health, capacity utilization, and thermal/power envelopes.
Build, upgrade, and maintain production-grade Kubernetes clusters on bare-metal infrastructure.
Design and operate cluster networking using CNI plugins for high-throughput AI workloads.
Implement resource quotas, LimitRanges, PriorityClasses, and node affinity/taints.
Deploy and operate MLOps platforms (MLflow, Kubeflow) for experiment tracking, model versioning, and pipeline orchestration.
Design the high-bandwidth network fabric required for GPU cluster interconnects.
Maintain hardened OS baselines across all infrastructure nodes; automate compliance scanning.

Requirements

6+ years of infrastructure engineering experience, with at least 3 years managing GPU compute clusters or HPC environments in production.
Deep hands-on expertise with NVIDIA GPU infrastructure: driver lifecycle management, CUDA, DCGM, MIG, NVLink topologies, and the NVIDIA GPU Operator for Kubernetes.
Production-level Kubernetes administration experience on bare-metal: cluster provisioning, upgrades, CNI/CSI configuration, RBAC, and day-2 operations.
Strong networking fundamentals: BGP, VLAN segmentation, RDMA/RoCE or InfiniBand configuration, load balancing, and firewall policy management.
Hands-on experience with software-defined storage (Ceph, Rook-Ceph, or MinIO) in AI/HPC workload contexts — performance tuning, capacity planning, and failure recovery.
Practical MLOps experience: model serving infrastructure (Triton or equivalent), experiment tracking (MLflow or Kubeflow), and GitOps-based model deployment pipelines.
Working knowledge of NIST SP 800-171 controls and the ability to translate them into concrete infrastructure configurations and audit evidence.
Proficiency with infrastructure-as-code tooling: Terraform or Ansible for reproducible, auditable infrastructure builds.
Strong Linux systems administration skills (RHEL/Rocky Linux or Ubuntu) including kernel tuning, storage I/O optimization, and systemd service management.
Excellent written communication for producing infrastructure runbooks, network diagrams, and compliance documentation in a remote-first environment.

Benefits

Competitive, globally benchmarked compensation including base salary, equity, and performance bonus.
Fully remote with async-first culture; periodic travel to client facilities and team on-sites for cluster deployments and planning.
Access to cutting-edge NVIDIA hardware, early access to new GPU generations, and budget for relevant certifications (NVIDIA, CKA/CKS, RHCSA, etc.).
Collaboration with a Lead Architect and engineering team who understand infrastructure as a product — not just a cost center.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

NVIDIA H200NVIDIA A100DCGMMIGKubernetesBGPVLANRDMACephTerraform

Soft Skills

written communication