AI Engineer – Model Performance

Fathom - AI Meeting Assistant

Model Performance Engineer at Fathom optimizing inference stack and fine-tuning infrastructure for AI applications. Focused on enhancing performance and efficiency for model deployment.

Posted 4/30/2026full-timeSan Francisco • California • 🇺🇸 United StatesMid-LevelSeniorWebsite

Tech Stack

Tools & technologies

PythonRaySwift

About the role

Key responsibilities & impact

own the speed, cost, and reliability of our model inference stack
build the fine-tuning infrastructure that makes the rest of the AI team faster
optimizing real systems serving millions of meetings — choosing between quantization trade-offs, debugging speculative decoding, or figuring out why one GPU family's tail latency explodes at high concurrency while another stays stable
Benchmark FP8 quantization across GPU families, find that FP8 KV cache causes catastrophic repetition loops, identify static quantization as 6% faster than dynamic on certain hardware
Evaluate serving frameworks (vLLM vs SGLang) with speculative decoding
Build a fine-tuning pipeline that takes a JSONL dataset and produces an optimized tune ready for serving
Optimize GPU spend — know which GPU families are best for batch workloads vs latency-sensitive paths
Debug production inference issues

Requirements

What you’ll need

Deep experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) — not just deploying them, but tuning them: attention backends, scheduling strategies, CUDA graph warmup, prefix caching
Hands-on quantization experience — you've gone beyond "apply FP8 and hope." You understand weight vs activation quantization, per-channel vs per-tensor scaling, and when dynamic quantization introduces more overhead than it saves
Production fine-tuning experience — LoRA/QLoRA SFT, familiarity with training frameworks (ms-swift, Axolotl, torchtune, or similar), understanding of data formatting, learning rate schedules, and how to diagnose training failures
Strong Python. You'll write serving infrastructure, benchmarking harnesses, and training pipelines — not notebooks
Comfort with GPU profiling and performance analysis. You should be able to look at a benchmark result and know whether the bottleneck is compute, memory bandwidth, or scheduling overhead
Cost modeling for GPU infrastructure — you've had to choose between GPU types and justify the tradeoff
Experience with multimodal models (audio/vision encoders + LLM decoders)
Experience with Modal, Ray Serve, or similar serverless GPU platforms
Understanding of audio processing (codecs, chunking, sample rates)
Experience building internal tooling that other engineers use — this role succeeds when the rest of the team ships faster

Benefits

Comp & perks

The opportunity to shape the foundational software services of a growing company
A role that balances innovation and incremental improvement
A dynamic and collaborative engineering team
Competitive compensation and benefits
A supportive environment that encourages innovation and personal growth

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

LLM serving frameworksquantizationfine-tuningPythonGPU profilingcost modelingmultimodal modelsaudio processingbenchmarkingdebugging

Soft Skills

problem-solvingcollaborationcommunicationanalytical thinkingattention to detail