HPC Solutions Engineer

Hydra Host

HPC Solutions Engineer designing and managing high-performance GPU computing environments for premium clients at Hydra Host. Collaborating with teams and optimizing systems for large-scale environments.

Posted 7/2/2026full-timeRemote • 🇺🇸 United StatesSeniorLeadWebsite

Tech Stack

Tools & technologies

AnsiblePythonTerraform

About the role

Key responsibilities & impact

Work with customers in technical discovery to help define requirements and deliverables for their use cases and help them to effectively utilize distributed GPU computing resources.
Identify and recommend the best tools for each customer, while building boilerplate and reference implementations that can be reused for subsequent customers.
Manage GPU clusters and coordinate with IT to ensure efficient operation of NVIDIA GPUs, utilizing technologies such as InfiniBand for high-speed networking.
Oversee the deployment and maintenance of machine learning environments using virtual storage solutions and distributed computing (HPC) tools such as SLURM.
Automate infrastructure provisioning and management using tools such as Ansible and Terraform.
Collaborate with data scientists and engineers to ensure seamless integration of ML models into production environments.
Conduct performance tuning and optimization of systems to maximize throughput and reduce latency.
Stay current with the latest industry trends in machine learning technologies and HPC to ensure the use of best practices in infrastructure setup and model development.
Document and maintain operational procedures and system configurations.

Requirements

What you’ll need

7+ years of high performance compute, distributed machine learning, GPU computing, and/or system architecture experience
Proficient in managing NVIDIA GPU environments and a familiarity with GPU computing frameworks and libraries
Strong experience with high-speed networking technologies, specifically InfiniBand
Experience with HPC job schedulers, preferably SLURM
Expertise in automating environment setup and maintenance using Ansible and Terraform
Demonstrated ability in deploying and managing virtual storage solutions
Strong coding skills in Python and familiarity with machine learning libraries and frameworks
Excellent problem-solving, communication, and teamwork skills.

Benefits

Comp & perks

You will work with the most diverse hardware configurations and locations available to anyone in the industry as well as cutting edge GPU use cases
This role is fully remote with a high accountability and high agency culture
This role offers a competitive salary, equity and benefits
This role offers flexible PTO

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

High Performance ComputingDistributed Machine LearningGPU ComputingSystem ArchitectureVirtual Storage SolutionsPerformance TuningMachine Learning LibrariesDistributed ComputingTechnical DiscoveryOperational Procedures Documentation

Soft Skills

Problem-SolvingCommunicationTeamwork