TensorWave

AI Infrastructure Engineer

TensorWave

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

CloudDistributed SystemsDockerKubernetesLinuxSpringTCP/IP

About the role

  • Build and maintain robust AI compute clusters ensuring optimal performance and reliability for demanding workloads
  • Collaborate with a dynamic IT team to design, deploy, and maintain high-performance AI compute clusters supporting both AMD and NVIDIA GPU technologies
  • Lead initiatives to optimize cluster performance, resource utilization, and job scheduling to maximize efficiency across diverse AI workloads
  • Ensure system reliability, performance, and security for cloud services, implementing monitoring solutions and automated recovery systems
  • Work closely with the AI development team to align infrastructure capabilities with the evolving needs of TensorWave's cloud platform
  • Troubleshoot and resolve complex infrastructure issues across Linux systems, networking, and distributed computing environments, providing expert guidance to maintain high service levels
  • Implement and maintain configuration management, deployment automation, and infrastructure-as-code practices

Requirements

  • Bachelor's degree in Computer Science, Information Technology, or related field
  • At least 5 years of relevant experience in infrastructure engineering, with a focus on supporting high-performance computing (HPC) and AI applications
  • Expert-level Linux system administration skills across multiple distributions
  • Strong experience with clustered computing environments (GPU, CPU, or hybrid clusters)
  • Solid understanding of networking fundamentals, including TCP/IP, routing protocols, and high-speed interconnects
  • Experience with container technologies (Docker, Kubernetes), job schedulers (Slurm, PBS), and configuration management tools
  • Familiarity with AMD and NVIDIA GPU ecosystems, CUDA, ROCm, and their infrastructure requirements
  • Exceptional debugging and problem-solving abilities with a methodical approach to complex system issues
  • Demonstrated ability to learn new technologies quickly and adapt to rapidly evolving infrastructure needs
NVIDIA

Senior Software Engineer, AI Systems

NVIDIA
Seniorfull-time$116k–$247k / year🇨🇦 Canada
Posted: 31 days agoSource: nvidia.wd5.myworkdayjobs.com
AWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesNode.jsPythonPyTorch
NVIDIA

Senior DGX Cloud Software Engineer, Infrastructure Automation and Distributed Systems

NVIDIA
Seniorfull-time$144k–$270k / yearCalifornia · 🇺🇸 United States
Posted: 12 days agoSource: nvidia.wd5.myworkdayjobs.com
CloudDistributed SystemsDockerGoKubernetesLinuxOpenStackPython
NVIDIA

Senior Datacenter System Software Architect, DGX Cloud

NVIDIA
Seniorfull-time$184k–$357k / yearCalifornia · 🇺🇸 United States
Posted: 33 days agoSource: nvidia.wd5.myworkdayjobs.com
CloudDistributed SystemsDockerGoKubernetesLinuxMicroservicesPythonPyTorchRustTensorflowTerraform
Nokia

Technical Lead

Nokia
Seniorfull-time🇮🇳 India
Posted: 17 days agoSource: fa-evmr-saasfaprod1.fa.ocs.oraclecloud.com
CloudDistributed SystemsDockerGoJ2EEJavaKubernetesMicroservicesPythonSpringSpring BootSpringBoot+1 more
Samsara

Senior Software Engineer – Optimization Engineering

Samsara
Seniorfull-time$143k–$185k / year🇨🇦 Canada
Posted: 7 days agoSource: boards.greenhouse.io
AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformIoTJava