TensorWave

AI Infrastructure Engineer

TensorWave

full-time

Posted on:

Location: 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

CloudDistributed SystemsDockerKubernetesLinuxSpringTCP/IP

About the role

  • Build and maintain robust AI compute clusters ensuring optimal performance and reliability for demanding workloads
  • Collaborate with a dynamic IT team to design, deploy, and maintain high-performance AI compute clusters supporting both AMD and NVIDIA GPU technologies
  • Lead initiatives to optimize cluster performance, resource utilization, and job scheduling to maximize efficiency across diverse AI workloads
  • Ensure system reliability, performance, and security for cloud services, implementing monitoring solutions and automated recovery systems
  • Work closely with the AI development team to align infrastructure capabilities with the evolving needs of TensorWave's cloud platform
  • Troubleshoot and resolve complex infrastructure issues across Linux systems, networking, and distributed computing environments, providing expert guidance to maintain high service levels
  • Implement and maintain configuration management, deployment automation, and infrastructure-as-code practices

Requirements

  • Bachelor's degree in Computer Science, Information Technology, or related field
  • At least 5 years of relevant experience in infrastructure engineering, with a focus on supporting high-performance computing (HPC) and AI applications
  • Expert-level Linux system administration skills across multiple distributions
  • Strong experience with clustered computing environments (GPU, CPU, or hybrid clusters)
  • Solid understanding of networking fundamentals, including TCP/IP, routing protocols, and high-speed interconnects
  • Experience with container technologies (Docker, Kubernetes), job schedulers (Slurm, PBS), and configuration management tools
  • Familiarity with AMD and NVIDIA GPU ecosystems, CUDA, ROCm, and their infrastructure requirements
  • Exceptional debugging and problem-solving abilities with a methodical approach to complex system issues
  • Demonstrated ability to learn new technologies quickly and adapt to rapidly evolving infrastructure needs