
Model Serving Engineer
Fundamental
full-time
Posted on:
Location Type: Remote
Location: Israel
Visit company websiteExplore more
Tech Stack
About the role
- Design, build, and maintain production model serving infrastructure using Triton Inference Server as the primary framework
- Implement and optimize inference pipelines including custom backends, dynamic batching strategies, and model ensemble configurations in Triton
- Optimize Python inference code for performance, with a strong focus on GIL contention, multi-threading, and concurrency patterns
- Tune throughput and latency across the full serving stack, batching policies, thread pool sizing, model instance groups, and memory layout
- Work closely with the research team to understand new model architectures at a computational level, batching behavior, dynamic shapes, memory access patterns etc
- Own the full resource observability and control loop for production inference - instrument GPU memory, CPU, batch queue depth, and latency metrics, and actively tune model instance groups, concurrency limits, memory budgets, and batching configuration in response to observed behavior
- Evaluate and integrate alternative inference frameworks and runtimes as the model ecosystem evolves
- Contribute to GPU utilization improvements and resource efficiency across the serving fleet
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
- 5+ years of experience in model serving, ML infrastructure, or a closely related backend engineering role
- Deep, production-level experience with Triton Inference Server, including custom Python backends, batching configuration, and model repository management
- Expert-level Python skills with a thorough understanding of the GIL, multi-threading, multiprocessing, and async concurrency patterns
- Strong understanding of neural network inference mechanics, forward passes, batching strategies, memory management, and numerical precision tradeoffs
- Hands-on experience with other inference frameworks (TorchServe, TensorFlow Serving, ONNX Runtime, vLLM, etc.) and the ability to evaluate tradeoffs between them
- Experience profiling and optimizing inference code for latency and throughput at production scale
Benefits
- Competitive compensation with salary and equity
- Comprehensive health coverage, including medical, dental, vision, and 401K
- Paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
- Relocation support for employees moving to join the team in one of our office locations
- A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonTriton Inference Servermulti-threadingconcurrency patternsmodel servinginference pipelinesbatching strategiesmemory managementprofilingoptimizing inference code
Certifications
Bachelor's degree in Computer ScienceMaster's degree in Computer ScienceEngineering degree