Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Fathom - AI Meeting Assistant

AI Engineer – Model Performance

Fathom - AI Meeting Assistant

Model Performance Engineer at Fathom optimizing inference stack and fine-tuning infrastructure for AI applications. Focused on enhancing performance and efficiency for model deployment.

Posted 4/30/2026full-timeSan Francisco • California • 🇺🇸 United StatesMid-LevelSeniorWebsite

Tech Stack

Tools & technologies
PythonRaySwift

About the role

Key responsibilities & impact
  • own the speed, cost, and reliability of our model inference stack
  • build the fine-tuning infrastructure that makes the rest of the AI team faster
  • optimizing real systems serving millions of meetings — choosing between quantization trade-offs, debugging speculative decoding, or figuring out why one GPU family's tail latency explodes at high concurrency while another stays stable
  • Benchmark FP8 quantization across GPU families, find that FP8 KV cache causes catastrophic repetition loops, identify static quantization as 6% faster than dynamic on certain hardware
  • Evaluate serving frameworks (vLLM vs SGLang) with speculative decoding
  • Build a fine-tuning pipeline that takes a JSONL dataset and produces an optimized tune ready for serving
  • Optimize GPU spend — know which GPU families are best for batch workloads vs latency-sensitive paths
  • Debug production inference issues

Requirements

What you’ll need
  • Deep experience with LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) — not just deploying them, but tuning them: attention backends, scheduling strategies, CUDA graph warmup, prefix caching
  • Hands-on quantization experience — you've gone beyond "apply FP8 and hope." You understand weight vs activation quantization, per-channel vs per-tensor scaling, and when dynamic quantization introduces more overhead than it saves
  • Production fine-tuning experience — LoRA/QLoRA SFT, familiarity with training frameworks (ms-swift, Axolotl, torchtune, or similar), understanding of data formatting, learning rate schedules, and how to diagnose training failures
  • Strong Python. You'll write serving infrastructure, benchmarking harnesses, and training pipelines — not notebooks
  • Comfort with GPU profiling and performance analysis. You should be able to look at a benchmark result and know whether the bottleneck is compute, memory bandwidth, or scheduling overhead
  • Cost modeling for GPU infrastructure — you've had to choose between GPU types and justify the tradeoff
  • Experience with multimodal models (audio/vision encoders + LLM decoders)
  • Experience with Modal, Ray Serve, or similar serverless GPU platforms
  • Understanding of audio processing (codecs, chunking, sample rates)
  • Experience building internal tooling that other engineers use — this role succeeds when the rest of the team ships faster

Benefits

Comp & perks
  • The opportunity to shape the foundational software services of a growing company
  • A role that balances innovation and incremental improvement
  • A dynamic and collaborative engineering team
  • Competitive compensation and benefits
  • A supportive environment that encourages innovation and personal growth

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
LLM serving frameworksquantizationfine-tuningPythonGPU profilingcost modelingmultimodal modelsaudio processingbenchmarkingdebugging
Soft Skills
problem-solvingcollaborationcommunicationanalytical thinkingattention to detail