FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

AI Inference Engineer – Model Optimization, Deployment
ZooxModel Optimization & Deployment Engineer optimizing large-scale ML models for Zoox's autonomous vehicle technology. Focused on deployment for efficient real-time execution in vehicles.
Posted 4/11/2026full-timeFoster City • California, Washington • 🇺🇸 United StatesMid-LevelSenior💰 $242,000 - $290,000 per yearWebsite
Tech Stack
Tools & technologiesPythonPyTorch
About the role
Key responsibilities & impact- Optimize large-scale models (LLMs, VLMs) using advanced quantization (PTQ, QAT), mixed-precision inference workflows, and parameter-efficient fine-tuning (LoRA, QLoRA).
- Architect and implement model conversion and compilation pipelines using TensorRT and TensorRT-LLM for edge deployment.
- Perform rigorous parity checking, accuracy recovery, and latency benchmarking between PyTorch frameworks and compiled edge binaries.
- Write and optimize custom CUDA kernels and TensorRT Plugins to maximize memory bandwidth and minimize latency on AI accelerators.
- Write production-level, highly concurrent, and memory-safe C++ and Python code for real-time inference on vehicle SOCs.
Requirements
What you’ll need- Deep expertise in model quantization (PTQ, QAT) and mixed-precision inference workflows (INT8, FP8, INT4, BF16/FP16).
- Proven experience optimizing large-scale models (LLMs, VLMs, or VLAs) utilizing KV-cache optimization (e.g., PagedAttention), Speculative Decoding, and Efficient Attention mechanisms (FlashAttention, Linear Attention).
- Extensive experience with model conversion/compilation pipelines (TensorRT, TensorRT-LLM) and performing rigorous parity/latency benchmarking.
- Proficiency in low-level programming for AI accelerators, specifically writing and optimizing custom CUDA kernels and TensorRT Plugins.
- Production-level C++ (14/17/20) and Python programming skills, with experience writing concurrent, memory-safe, real-time inference code for edge devices.
Benefits
Comp & perks- Paid time off (e.g. sick leave, vacation, bereavement)
- Unpaid time off
- Zoox Stock Appreciation Rights
- Amazon RSUs
- Health insurance
- Long-term care insurance
- Long-term and short-term disability insurance
- Life insurance
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
model quantizationmixed-precision inferenceCUDA programmingC++ programmingPython programmingreal-time inferenceconcurrent programmingmemory-safe programmingmodel optimizationlatency benchmarking