
Senior AI Infrastructure Engineer
WEX
full-time
Posted on:
Location Type: Hybrid
Location: California • Illinois • United States
Visit company websiteExplore more
Salary
💰 $121,500 - $145,500 per year
Job Level
About the role
- Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving
- Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token
- Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs
- Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning
- Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure
- Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure
- Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively
Requirements
- 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering
- Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems
- Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads
- Experience deploying and scaling open-source LLMs and embedding models using containerized solutions
- Strong belief in "Everything as Code"—you automate toil wherever possible using Python, Go, or Bash
Benefits
- health, dental and vision insurances
- retirement savings plan
- paid time off
- health savings account
- flexible spending accounts
- life insurance
- disability insurance
- tuition reimbursement
- bonuses
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesAI platform designlow-latency serving solutionsvLLMTGITritonGPU cluster managementInfrastructure as CodeTerraformAnsible
Soft Skills
incident responseautomationproblem-solvingcommunication