Tech Stack
CloudDistributed SystemsDockerKubernetesLinuxSpringTCP/IP
About the role
- Build and maintain robust AI compute clusters ensuring optimal performance and reliability for demanding workloads
- Collaborate with a dynamic IT team to design, deploy, and maintain high-performance AI compute clusters supporting both AMD and NVIDIA GPU technologies
- Lead initiatives to optimize cluster performance, resource utilization, and job scheduling to maximize efficiency across diverse AI workloads
- Ensure system reliability, performance, and security for cloud services, implementing monitoring solutions and automated recovery systems
- Work closely with the AI development team to align infrastructure capabilities with the evolving needs of TensorWave's cloud platform
- Troubleshoot and resolve complex infrastructure issues across Linux systems, networking, and distributed computing environments, providing expert guidance to maintain high service levels
- Implement and maintain configuration management, deployment automation, and infrastructure-as-code practices
Requirements
- Bachelor's degree in Computer Science, Information Technology, or related field
- At least 5 years of relevant experience in infrastructure engineering, with a focus on supporting high-performance computing (HPC) and AI applications
- Expert-level Linux system administration skills across multiple distributions
- Strong experience with clustered computing environments (GPU, CPU, or hybrid clusters)
- Solid understanding of networking fundamentals, including TCP/IP, routing protocols, and high-speed interconnects
- Experience with container technologies (Docker, Kubernetes), job schedulers (Slurm, PBS), and configuration management tools
- Familiarity with AMD and NVIDIA GPU ecosystems, CUDA, ROCm, and their infrastructure requirements
- Exceptional debugging and problem-solving abilities with a methodical approach to complex system issues
- Demonstrated ability to learn new technologies quickly and adapt to rapidly evolving infrastructure needs