Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Montauk Capital

Head of Inference

Montauk Capital

Head of Inference at Stealth Edge AI Co defining inference architecture and building proof of concept systems. Collaborate with leadership on pioneering AI solutions.

Posted 5/6/2026full-timeNew York City • New York • 🇺🇸 United StatesLeadWebsite

Tech Stack

Tools & technologies
CloudDistributed SystemsKubernetesNode.jsRayRust

About the role

Key responsibilities & impact
  • Create the inference strategy and define the inference architecture for Edge AI
  • Own the inference serving layer end-to-end: vLLM, TensorRT-LLM, Triton, or equivalent
  • Build a credible POC fast — proves the platform works to NVIDIA, cloud providers, and customers
  • Drive cost-per-token optimization
  • Optimize GPU utilization, KV-cache management, and batching for production workloads
  • Own observability and reliability SLAs
  • Build distributed inference pipelines across multi-GPU, multi-node edge deployments
  • Set performance baselines and SLAs for inference latency and throughput, plus observability and performance SLA’s
  • Define quantization strategy
  • Translate complex inference requirements for infrastructure designs
  • Define the software access layer architecture and oversee integration efforts
  • Engage credibly with investors, partners, and technical stakeholders, represent the company externally

Requirements

What you’ll need
  • Production inference serving — vLLM, TensorRT-LLM, Triton Inference Server, or equivalent distributed at scale
  • Quantization, SGLang, containerization, cost-per-token
  • Observability tooling: distributed tracing, latency profiling, alerting. Instrument and debug complex distributed systems with a focus on building world-class observability and debuggability tools
  • C++/CUDA/Rust
  • GPU utilization and CUDA kernel optimization — has pushed hardware to its limits
  • Batching, KV-cache, speculative decoding expertise
  • Scale systems using Kubernetes, Ray, custom load balancing, multi-GPU/multi-node inference
  • Has built a serving system that NVIDIA and cloud providers respect
  • Model deployment and serving
  • Systems engineering
  • Technical leadership experience, either over teams or outcomes
  • Startup / 0→1 DNA: You ship fast and communicate clearly

Benefits

Comp & perks
  • Competitive compensation + equity: True ownership over what you build

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
vLLMTensorRT-LLMTriton Inference ServerC++CUDARustquantizationbatchingGPU optimizationdistributed systems
Soft Skills
technical leadershipcommunicationstakeholder engagementproblem-solvingobservability focusdebuggingcost optimizationperformance optimizationcollaborationadaptability