
ML Ops Infrastructure Engineer
Thermo Fisher Scientific
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $150,000 per year
Tech Stack
About the role
- Deploy, configure, and maintain on-premises GPU servers — primarily NVIDIA H200 and A100 nodes.
- Implement and tune NVIDIA-specific tooling: DCGM (Data Center GPU Manager), MIG (Multi-Instance GPU) partitioning, and NVIDIA Container Toolkit.
- Manage bare-metal provisioning workflows (iPXE, PXE, MAAS/Foreman) for repeatable server builds.
- Monitor hardware health, capacity utilization, and thermal/power envelopes.
- Build, upgrade, and maintain production-grade Kubernetes clusters on bare-metal infrastructure.
- Design and operate cluster networking using CNI plugins for high-throughput AI workloads.
- Implement resource quotas, LimitRanges, PriorityClasses, and node affinity/taints.
- Deploy and operate MLOps platforms (MLflow, Kubeflow) for experiment tracking, model versioning, and pipeline orchestration.
- Design the high-bandwidth network fabric required for GPU cluster interconnects.
- Maintain hardened OS baselines across all infrastructure nodes; automate compliance scanning.
Requirements
- 6+ years of infrastructure engineering experience, with at least 3 years managing GPU compute clusters or HPC environments in production.
- Deep hands-on expertise with NVIDIA GPU infrastructure: driver lifecycle management, CUDA, DCGM, MIG, NVLink topologies, and the NVIDIA GPU Operator for Kubernetes.
- Production-level Kubernetes administration experience on bare-metal: cluster provisioning, upgrades, CNI/CSI configuration, RBAC, and day-2 operations.
- Strong networking fundamentals: BGP, VLAN segmentation, RDMA/RoCE or InfiniBand configuration, load balancing, and firewall policy management.
- Hands-on experience with software-defined storage (Ceph, Rook-Ceph, or MinIO) in AI/HPC workload contexts — performance tuning, capacity planning, and failure recovery.
- Practical MLOps experience: model serving infrastructure (Triton or equivalent), experiment tracking (MLflow or Kubeflow), and GitOps-based model deployment pipelines.
- Working knowledge of NIST SP 800-171 controls and the ability to translate them into concrete infrastructure configurations and audit evidence.
- Proficiency with infrastructure-as-code tooling: Terraform or Ansible for reproducible, auditable infrastructure builds.
- Strong Linux systems administration skills (RHEL/Rocky Linux or Ubuntu) including kernel tuning, storage I/O optimization, and systemd service management.
- Excellent written communication for producing infrastructure runbooks, network diagrams, and compliance documentation in a remote-first environment.
Benefits
- Competitive, globally benchmarked compensation including base salary, equity, and performance bonus.
- Fully remote with async-first culture; periodic travel to client facilities and team on-sites for cluster deployments and planning.
- Access to cutting-edge NVIDIA hardware, early access to new GPU generations, and budget for relevant certifications (NVIDIA, CKA/CKS, RHCSA, etc.).
- Collaboration with a Lead Architect and engineering team who understand infrastructure as a product — not just a cost center.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
NVIDIA H200NVIDIA A100DCGMMIGKubernetesBGPVLANRDMACephTerraform
Soft Skills
written communication