FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Staff Software Engineer, Machine Learning Inference Platform
Stack AVStaff Engineer defining architecture for multi-tenant ML inference workloads at Stack AV. Balance coding with technical direction across ML Platform, infrastructure, and APIs.
Tech Stack
Tools & technologiesDistributed SystemsGoGRPCPythonPyTorchRust
About the role
Key responsibilities & impact- Design platform architecture for multi-tenant inference workloads across serving, orchestration, control plane, APIs, SDKs, observability, and model-engine integration.
- Develop robust API layers (gRPC, WebSockets, REST, etc.) and developer SDKs that abstract complex distributed inference orchestration into seamless, reliable token streams.
- Build and harden a multi-tenant control plane to enable accurate metering, rate limiting, quotas, tenant isolation and noisy-neighbor fairness across the platform.
- Optimize inference performance across the entire system stack, including the model engine layer.
- Build observability and SLOs to gain insights into system economics, cache-hit rates, GPU utilization and cost accounting per model and per tenant.
- Partner with product and infrastructure teams on model onboarding, capacity planning, external API contracts and customer adoption.
- Promote Engineering Excellence: Maintain a high bar for engineering excellence in their own work but also set a culture of engineering excellence within the team.
Requirements
What you’ll need- Education: Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- Experience: 7+ years of experience building and operating backend distributed systems end to end.
- Demonstrated cross-team technical leadership in backend distributed systems, ML infrastructure, inference serving, or high-performance compute platforms.
- Strong Data & ML systems fundamentals: data-intensive distributed systems, concurrency, networking and performance profiling.
- Hands-on experience running large-scale inference services on GPUs, including KV caches, prefill/decode stages and throughput/latency trade-offs.
- Direct experience with inference engines (TensorRT, vLLM, etc) or serving frameworks (Dynamo, Triton or equivalent).
- Technical Skills:
- Strong programming skills in C++, Go, Rust or Python.
- Familiarity with deep learning frameworks (PyTorch, etc.) as well as model parallelism.
- Familiarity with GPU computing primitives such as CUDA, NCCL, NVLink, and hardware-specific optimizations.
- Practical understanding of high-performance networking architectures, including InfiniBand, RoCE, and low-latency cluster communication.
- Communication: Excellent verbal and written communication skills, with the ability to convey complex technical concepts to non-technical stakeholders.
- Autonomous vehicles (AV) experience is a bonus.
Benefits
Comp & perks- We are proud to be an equal opportunity workplace. We believe that diverse teams produce the best ideas and outcomes. We are committed to building a culture of inclusion, entrepreneurship, and innovation across gender, race, age, sexual orientation, religion, disability, and identity.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
C++GoRustPythongRPCWebSocketsRESTTensorRTvLLMCUDA
Soft Skills
technical leadershipcommunicationengineering excellence