FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

HPC Solutions Engineer
Hydra HostHPC Solutions Engineer designing and managing high-performance GPU computing environments for premium clients at Hydra Host. Collaborating with teams and optimizing systems for large-scale environments.
Tech Stack
Tools & technologiesAnsiblePythonTerraform
About the role
Key responsibilities & impact- Work with customers in technical discovery to help define requirements and deliverables for their use cases and help them to effectively utilize distributed GPU computing resources.
- Identify and recommend the best tools for each customer, while building boilerplate and reference implementations that can be reused for subsequent customers.
- Manage GPU clusters and coordinate with IT to ensure efficient operation of NVIDIA GPUs, utilizing technologies such as InfiniBand for high-speed networking.
- Oversee the deployment and maintenance of machine learning environments using virtual storage solutions and distributed computing (HPC) tools such as SLURM.
- Automate infrastructure provisioning and management using tools such as Ansible and Terraform.
- Collaborate with data scientists and engineers to ensure seamless integration of ML models into production environments.
- Conduct performance tuning and optimization of systems to maximize throughput and reduce latency.
- Stay current with the latest industry trends in machine learning technologies and HPC to ensure the use of best practices in infrastructure setup and model development.
- Document and maintain operational procedures and system configurations.
Requirements
What you’ll need- 7+ years of high performance compute, distributed machine learning, GPU computing, and/or system architecture experience
- Proficient in managing NVIDIA GPU environments and a familiarity with GPU computing frameworks and libraries
- Strong experience with high-speed networking technologies, specifically InfiniBand
- Experience with HPC job schedulers, preferably SLURM
- Expertise in automating environment setup and maintenance using Ansible and Terraform
- Demonstrated ability in deploying and managing virtual storage solutions
- Strong coding skills in Python and familiarity with machine learning libraries and frameworks
- Excellent problem-solving, communication, and teamwork skills.
Benefits
Comp & perks- You will work with the most diverse hardware configurations and locations available to anyone in the industry as well as cutting edge GPU use cases
- This role is fully remote with a high accountability and high agency culture
- This role offers a competitive salary, equity and benefits
- This role offers flexible PTO
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
High Performance ComputingDistributed Machine LearningGPU ComputingSystem ArchitectureVirtual Storage SolutionsPerformance TuningMachine Learning LibrariesDistributed ComputingTechnical DiscoveryOperational Procedures Documentation
Soft Skills
Problem-SolvingCommunicationTeamwork