Thermo Fisher Scientific

ML Ops Infrastructure Engineer

Thermo Fisher Scientific

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $150,000 per year

About the role

  • Deploy, configure, and maintain on-premises GPU servers — primarily NVIDIA H200 and A100 nodes.
  • Implement and tune NVIDIA-specific tooling: DCGM (Data Center GPU Manager), MIG (Multi-Instance GPU) partitioning, and NVIDIA Container Toolkit.
  • Manage bare-metal provisioning workflows (iPXE, PXE, MAAS/Foreman) for repeatable server builds.
  • Monitor hardware health, capacity utilization, and thermal/power envelopes.
  • Build, upgrade, and maintain production-grade Kubernetes clusters on bare-metal infrastructure.
  • Design and operate cluster networking using CNI plugins for high-throughput AI workloads.
  • Implement resource quotas, LimitRanges, PriorityClasses, and node affinity/taints.
  • Deploy and operate MLOps platforms (MLflow, Kubeflow) for experiment tracking, model versioning, and pipeline orchestration.
  • Design the high-bandwidth network fabric required for GPU cluster interconnects.
  • Maintain hardened OS baselines across all infrastructure nodes; automate compliance scanning.

Requirements

  • 6+ years of infrastructure engineering experience, with at least 3 years managing GPU compute clusters or HPC environments in production.
  • Deep hands-on expertise with NVIDIA GPU infrastructure: driver lifecycle management, CUDA, DCGM, MIG, NVLink topologies, and the NVIDIA GPU Operator for Kubernetes.
  • Production-level Kubernetes administration experience on bare-metal: cluster provisioning, upgrades, CNI/CSI configuration, RBAC, and day-2 operations.
  • Strong networking fundamentals: BGP, VLAN segmentation, RDMA/RoCE or InfiniBand configuration, load balancing, and firewall policy management.
  • Hands-on experience with software-defined storage (Ceph, Rook-Ceph, or MinIO) in AI/HPC workload contexts — performance tuning, capacity planning, and failure recovery.
  • Practical MLOps experience: model serving infrastructure (Triton or equivalent), experiment tracking (MLflow or Kubeflow), and GitOps-based model deployment pipelines.
  • Working knowledge of NIST SP 800-171 controls and the ability to translate them into concrete infrastructure configurations and audit evidence.
  • Proficiency with infrastructure-as-code tooling: Terraform or Ansible for reproducible, auditable infrastructure builds.
  • Strong Linux systems administration skills (RHEL/Rocky Linux or Ubuntu) including kernel tuning, storage I/O optimization, and systemd service management.
  • Excellent written communication for producing infrastructure runbooks, network diagrams, and compliance documentation in a remote-first environment.
Benefits
  • Competitive, globally benchmarked compensation including base salary, equity, and performance bonus.
  • Fully remote with async-first culture; periodic travel to client facilities and team on-sites for cluster deployments and planning.
  • Access to cutting-edge NVIDIA hardware, early access to new GPU generations, and budget for relevant certifications (NVIDIA, CKA/CKS, RHCSA, etc.).
  • Collaboration with a Lead Architect and engineering team who understand infrastructure as a product — not just a cost center.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
NVIDIA H200NVIDIA A100DCGMMIGKubernetesBGPVLANRDMACephTerraform
Soft Skills
written communication