Model Serving Engineer

Fundamental

full-time

Posted on: 4/2/2026

Location Type: Remote

Location: Israel

Visit company website

Explore more

Engineer jobs

✨ AI Apply

Apply

Job Level

Mid-Level Senior

Tech Stack

Python Tensorflow

About the role

Design, build, and maintain production model serving infrastructure using Triton Inference Server as the primary framework
Implement and optimize inference pipelines including custom backends, dynamic batching strategies, and model ensemble configurations in Triton
Optimize Python inference code for performance, with a strong focus on GIL contention, multi-threading, and concurrency patterns
Tune throughput and latency across the full serving stack, batching policies, thread pool sizing, model instance groups, and memory layout
Work closely with the research team to understand new model architectures at a computational level, batching behavior, dynamic shapes, memory access patterns etc
Own the full resource observability and control loop for production inference - instrument GPU memory, CPU, batch queue depth, and latency metrics, and actively tune model instance groups, concurrency limits, memory budgets, and batching configuration in response to observed behavior
Evaluate and integrate alternative inference frameworks and runtimes as the model ecosystem evolves
Contribute to GPU utilization improvements and resource efficiency across the serving fleet

Requirements

Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
5+ years of experience in model serving, ML infrastructure, or a closely related backend engineering role
Deep, production-level experience with Triton Inference Server, including custom Python backends, batching configuration, and model repository management
Expert-level Python skills with a thorough understanding of the GIL, multi-threading, multiprocessing, and async concurrency patterns
Strong understanding of neural network inference mechanics, forward passes, batching strategies, memory management, and numerical precision tradeoffs
Hands-on experience with other inference frameworks (TorchServe, TensorFlow Serving, ONNX Runtime, vLLM, etc.) and the ability to evaluate tradeoffs between them
Experience profiling and optimizing inference code for latency and throughput at production scale

Benefits

Competitive compensation with salary and equity
Comprehensive health coverage, including medical, dental, vision, and 401K
Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
Relocation support for employees moving to join the team in one of our office locations
A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PythonTriton Inference Servermulti-threadingconcurrency patternsmodel servinginference pipelinesbatching strategiesmemory managementprofilingoptimizing inference code

Certifications

Bachelor's degree in Computer ScienceMaster's degree in Computer ScienceEngineering degree