Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Biohub

Staff HPC Engineer

Biohub

Staff HPC Engineer designing hybrid HPC-AI infrastructure at Biohub, enhancing AI research capabilities through advanced computing solutions.

Posted 5/21/2026full-timeSan Francisco • California • 🇺🇸 United StatesLead💰 $214,000 - $268,000 per yearWebsite

Tech Stack

Tools & technologies
AnsibleAWSCloudDockerGoogle Cloud PlatformKubernetesLinuxPuppetPythonPyTorchRayTensorflowTerraform

About the role

Key responsibilities & impact
  • Build and support a hybrid HPC-AI environment with large-scale on-prem compute/storage and elastic cloud GPU clusters (Coreweave, AWS, GCP).
  • Architect and optimize environments for large-scale AI training and tuning, and low-latency scientific workloads.
  • Integrate MLOps and model deployment pipelines into HPC infrastructure, ensuring reproducibility and efficiency.
  • Implement advanced resource scheduling and orchestration (Slurm, Kubernetes, SUNK) optimized for mixed HPC and AI workflows.
  • Support researchers with job optimization, GPU utilization best practices, and performance tuning for AI and HPC applications.
  • Evaluate, deploy, and maintain AI/ML software stacks (e.g., PyTorch, TensorFlow, Hugging Face, RAPIDS) and HPC toolchains.
  • Ensure robust data ingest, analysis, and management capabilities for AI and HPC workloads, including integration with parallel file systems and object storage.
  • Work with diverse science teams to translate research requirements into hardware/software solutions, from experimental design through publication.
  • Promote best practices for AI model training, validation, and deployment in shared computing environments.
  • Foster a culture of shared learning by running internal workshops on HPC-AI tooling (e.g., VS Code remote dev, containerization, MLOps workflows).

Requirements

What you’ll need
  • Bachelor’s or advanced degree in Computer Science, AI/ML, Data Science, Systems Engineering, or related field.
  • 10+ years building and managing HPC infrastructure, with significant experience integrating AI/ML workloads.
  • Proven track record architecting environments for large-scale GPU AI training and inference in hybrid on-prem/cloud environments.
  • Deep expertise with HPC scheduling (Slurm), container orchestration (Kubernetes), and cloud GPU services.
  • Strong hands-on experience with AI frameworks (PyTorch, TensorFlow, JAX) and distributed training strategies (Horovod, DeepSpeed, Ray).
  • Knowledge of MLOps best practices, including CI/CD for ML, model registry, experiment tracking, and performance monitoring.
  • Exceptional ability to collaborate with multidisciplinary teams and communicate complex technical concepts clearly.
  • Demonstrated leadership in guiding infrastructure teams, influencing organizational strategy, and fostering adoption of new technologies.
  • Advanced Linux systems administration, HPC networking (Infiniband, Ethernet), and storage systems administration (VAST Lustre, Weka and ZFS)
  • Cloud platform expertise (Coreweave, AWS, GCP) including GPU provisioning, storage, and networking for AI workloads.
  • Proficiency in automation tools (Terraform, Ansible, Puppet), containerization (Docker, Singularity), and orchestration frameworks.
  • Strong experience debugging and troubleshooting hardware across the stack (network, GPU, compute and storage systems).
  • Strong scripting/programming skills (Python, Bash) and familiarity with version control (Git).
  • Experience integrating AI LLMs, AI coding assistants, and custom model development into HPC workflows.

Benefits

Comp & perks
  • Provides a generous employer match on employee 401(k) contributions to support planning for the future.
  • Paid time off to volunteer at an organization of your choice.
  • Funding for select family-forming benefits.
  • Relocation support for employees who need assistance moving

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
HPC infrastructureAI frameworksMLOpsGPU provisioningLinux systems administrationHPC networkingstorage systems administrationscriptingprogrammingdistributed training strategies
Soft Skills
collaborationcommunicationleadershipproblem-solvingorganizational strategyshared learningworkshop facilitationinfluencingtechnical concept explanationmultidisciplinary teamwork