Fundamental

Model Serving Engineer

Fundamental

full-time

Posted on:

Location Type: Remote

Location: Israel

Visit company website

Explore more

AI Apply
Apply

About the role

  • Design, build, and maintain production model serving infrastructure using Triton Inference Server as the primary framework
  • Implement and optimize inference pipelines including custom backends, dynamic batching strategies, and model ensemble configurations in Triton
  • Optimize Python inference code for performance, with a strong focus on GIL contention, multi-threading, and concurrency patterns
  • Tune throughput and latency across the full serving stack, batching policies, thread pool sizing, model instance groups, and memory layout
  • Work closely with the research team to understand new model architectures at a computational level, batching behavior, dynamic shapes, memory access patterns etc
  • Own the full resource observability and control loop for production inference - instrument GPU memory, CPU, batch queue depth, and latency metrics, and actively tune model instance groups, concurrency limits, memory budgets, and batching configuration in response to observed behavior
  • Evaluate and integrate alternative inference frameworks and runtimes as the model ecosystem evolves
  • Contribute to GPU utilization improvements and resource efficiency across the serving fleet

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
  • 5+ years of experience in model serving, ML infrastructure, or a closely related backend engineering role
  • Deep, production-level experience with Triton Inference Server, including custom Python backends, batching configuration, and model repository management
  • Expert-level Python skills with a thorough understanding of the GIL, multi-threading, multiprocessing, and async concurrency patterns
  • Strong understanding of neural network inference mechanics, forward passes, batching strategies, memory management, and numerical precision tradeoffs
  • Hands-on experience with other inference frameworks (TorchServe, TensorFlow Serving, ONNX Runtime, vLLM, etc.) and the ability to evaluate tradeoffs between them
  • Experience profiling and optimizing inference code for latency and throughput at production scale
Benefits
  • Competitive compensation with salary and equity
  • Comprehensive health coverage, including medical, dental, vision, and 401K
  • Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
  • Relocation support for employees moving to join the team in one of our office locations
  • A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonTriton Inference Servermulti-threadingconcurrency patternsmodel servinginference pipelinesbatching strategiesmemory managementprofilingoptimizing inference code
Certifications
Bachelor's degree in Computer ScienceMaster's degree in Computer ScienceEngineering degree