WEX

Senior AI Infrastructure Engineer

WEX

full-time

Posted on:

Location Type: Hybrid

Location: CaliforniaIllinoisUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $121,500 - $145,500 per year

Job Level

About the role

  • Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving
  • Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token
  • Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs
  • Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning
  • Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure
  • Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure
  • Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively

Requirements

  • 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering
  • Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems
  • Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads
  • Experience deploying and scaling open-source LLMs and embedding models using containerized solutions
  • Strong belief in "Everything as Code"—you automate toil wherever possible using Python, Go, or Bash
Benefits
  • health, dental and vision insurances
  • retirement savings plan
  • paid time off
  • health savings account
  • flexible spending accounts
  • life insurance
  • disability insurance
  • tuition reimbursement
  • bonuses
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
KubernetesAI platform designlow-latency serving solutionsvLLMTGITritonGPU cluster managementInfrastructure as CodeTerraformAnsible
Soft Skills
incident responseautomationproblem-solvingcommunication