AI Infrastructure Engineer

TensorWave

full-time

Posted on: 9/19/2025

Origin: • 🇺🇸 United States

✨ AI Apply

Mid-LevelSenior

CloudDistributed SystemsDockerKubernetesLinuxSpringTCP/IP

About the role

Build and maintain robust AI compute clusters ensuring optimal performance and reliability for demanding workloads
Collaborate with a dynamic IT team to design, deploy, and maintain high-performance AI compute clusters supporting both AMD and NVIDIA GPU technologies
Lead initiatives to optimize cluster performance, resource utilization, and job scheduling to maximize efficiency across diverse AI workloads
Ensure system reliability, performance, and security for cloud services, implementing monitoring solutions and automated recovery systems
Work closely with the AI development team to align infrastructure capabilities with the evolving needs of TensorWave's cloud platform
Troubleshoot and resolve complex infrastructure issues across Linux systems, networking, and distributed computing environments, providing expert guidance to maintain high service levels
Implement and maintain configuration management, deployment automation, and infrastructure-as-code practices

Bachelor's degree in Computer Science, Information Technology, or related field
At least 5 years of relevant experience in infrastructure engineering, with a focus on supporting high-performance computing (HPC) and AI applications
Expert-level Linux system administration skills across multiple distributions
Strong experience with clustered computing environments (GPU, CPU, or hybrid clusters)
Solid understanding of networking fundamentals, including TCP/IP, routing protocols, and high-speed interconnects
Experience with container technologies (Docker, Kubernetes), job schedulers (Slurm, PBS), and configuration management tools
Familiarity with AMD and NVIDIA GPU ecosystems, CUDA, ROCm, and their infrastructure requirements
Exceptional debugging and problem-solving abilities with a methodical approach to complex system issues
Demonstrated ability to learn new technologies quickly and adapt to rapidly evolving infrastructure needs