FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

AI Engineer – Model Performance
Fathom - AI Meeting AssistantModel Performance Engineer at Fathom optimizing inference stack and fine-tuning infrastructure for AI applications. Focused on enhancing performance and efficiency for model deployment.
Tech Stack
Tools & technologiesPythonRaySwift
About the role
Key responsibilities & impact- own the speed, cost, and reliability of our model inference stack
- build the fine-tuning infrastructure that makes the rest of the AI team faster
- optimizing real systems serving millions of meetings — choosing between quantization trade-offs, debugging speculative decoding, or figuring out why one GPU family's tail latency explodes at high concurrency while another stays stable
- Benchmark FP8 quantization across GPU families, find that FP8 KV cache causes catastrophic repetition loops, identify static quantization as 6% faster than dynamic on certain hardware
- Evaluate serving frameworks (vLLM vs SGLang) with speculative decoding
- Build a fine-tuning pipeline that takes a JSONL dataset and produces an optimized tune ready for serving
- Optimize GPU spend — know which GPU families are best for batch workloads vs latency-sensitive paths
- Debug production inference issues
Requirements
What you’ll need- Deep experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) — not just deploying them, but tuning them: attention backends, scheduling strategies, CUDA graph warmup, prefix caching
- Hands-on quantization experience — you've gone beyond "apply FP8 and hope." You understand weight vs activation quantization, per-channel vs per-tensor scaling, and when dynamic quantization introduces more overhead than it saves
- Production fine-tuning experience — LoRA/QLoRA SFT, familiarity with training frameworks (ms-swift, Axolotl, torchtune, or similar), understanding of data formatting, learning rate schedules, and how to diagnose training failures
- Strong Python. You'll write serving infrastructure, benchmarking harnesses, and training pipelines — not notebooks
- Comfort with GPU profiling and performance analysis. You should be able to look at a benchmark result and know whether the bottleneck is compute, memory bandwidth, or scheduling overhead
- Cost modeling for GPU infrastructure — you've had to choose between GPU types and justify the tradeoff
- Experience with multimodal models (audio/vision encoders + LLM decoders)
- Experience with Modal, Ray Serve, or similar serverless GPU platforms
- Understanding of audio processing (codecs, chunking, sample rates)
- Experience building internal tooling that other engineers use — this role succeeds when the rest of the team ships faster
Benefits
Comp & perks- The opportunity to shape the foundational software services of a growing company
- A role that balances innovation and incremental improvement
- A dynamic and collaborative engineering team
- Competitive compensation and benefits
- A supportive environment that encourages innovation and personal growth
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
LLM serving frameworksquantizationfine-tuningPythonGPU profilingcost modelingmultimodal modelsaudio processingbenchmarkingdebugging
Soft Skills
problem-solvingcollaborationcommunicationanalytical thinkingattention to detail