FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAnsibleAWSCloudDockerGoogle Cloud PlatformKubernetesLinuxPuppetPythonPyTorchRayTensorflowTerraform
About the role
Key responsibilities & impact- Build and support a hybrid HPC-AI environment with large-scale on-prem compute/storage and elastic cloud GPU clusters (Coreweave, AWS, GCP).
- Architect and optimize environments for large-scale AI training and tuning, and low-latency scientific workloads.
- Integrate MLOps and model deployment pipelines into HPC infrastructure, ensuring reproducibility and efficiency.
- Implement advanced resource scheduling and orchestration (Slurm, Kubernetes, SUNK) optimized for mixed HPC and AI workflows.
- Support researchers with job optimization, GPU utilization best practices, and performance tuning for AI and HPC applications.
- Evaluate, deploy, and maintain AI/ML software stacks (e.g., PyTorch, TensorFlow, Hugging Face, RAPIDS) and HPC toolchains.
- Ensure robust data ingest, analysis, and management capabilities for AI and HPC workloads, including integration with parallel file systems and object storage.
- Work with diverse science teams to translate research requirements into hardware/software solutions, from experimental design through publication.
- Promote best practices for AI model training, validation, and deployment in shared computing environments.
- Foster a culture of shared learning by running internal workshops on HPC-AI tooling (e.g., VS Code remote dev, containerization, MLOps workflows).
Requirements
What you’ll need- Bachelor’s or advanced degree in Computer Science, AI/ML, Data Science, Systems Engineering, or related field.
- 10+ years building and managing HPC infrastructure, with significant experience integrating AI/ML workloads.
- Proven track record architecting environments for large-scale GPU AI training and inference in hybrid on-prem/cloud environments.
- Deep expertise with HPC scheduling (Slurm), container orchestration (Kubernetes), and cloud GPU services.
- Strong hands-on experience with AI frameworks (PyTorch, TensorFlow, JAX) and distributed training strategies (Horovod, DeepSpeed, Ray).
- Knowledge of MLOps best practices, including CI/CD for ML, model registry, experiment tracking, and performance monitoring.
- Exceptional ability to collaborate with multidisciplinary teams and communicate complex technical concepts clearly.
- Demonstrated leadership in guiding infrastructure teams, influencing organizational strategy, and fostering adoption of new technologies.
- Advanced Linux systems administration, HPC networking (Infiniband, Ethernet), and storage systems administration (VAST Lustre, Weka and ZFS)
- Cloud platform expertise (Coreweave, AWS, GCP) including GPU provisioning, storage, and networking for AI workloads.
- Proficiency in automation tools (Terraform, Ansible, Puppet), containerization (Docker, Singularity), and orchestration frameworks.
- Strong experience debugging and troubleshooting hardware across the stack (network, GPU, compute and storage systems).
- Strong scripting/programming skills (Python, Bash) and familiarity with version control (Git).
- Experience integrating AI LLMs, AI coding assistants, and custom model development into HPC workflows.
Benefits
Comp & perks- Provides a generous employer match on employee 401(k) contributions to support planning for the future.
- Paid time off to volunteer at an organization of your choice.
- Funding for select family-forming benefits.
- Relocation support for employees who need assistance moving
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
HPC infrastructureAI frameworksMLOpsGPU provisioningLinux systems administrationHPC networkingstorage systems administrationscriptingprogrammingdistributed training strategies
Soft Skills
collaborationcommunicationleadershipproblem-solvingorganizational strategyshared learningworkshop facilitationinfluencingtechnical concept explanationmultidisciplinary teamwork
