Machine Learning Engineer – Inference Optimization

Featherless AI

full-time

Posted on: 1/22/2026

Location Type: Remote

Location: Anywhere in the World

Visit company website

Explore more

Machine Learning Engineer jobs

✨ AI Apply

Apply

Job Level

Mid-Level Senior

Tech Stack

Cloud Distributed Systems PyTorch

About the role

Optimize inference latency, throughput, and cost for large-scale ML models in production
Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)
Implement and tune techniques such as:
Quantization (fp16, bf16, int8, fp8)
KV-cache optimization & reuse
Speculative decoding, batching, and streaming
Model pruning or architectural simplifications for inference
Collaborate with research engineers to productionize new model architectures
Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)
Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups
Improve system reliability, observability, and cost efficiency under real workloads

Requirements

Strong experience in ML inference optimization or high-performance ML systems
Solid understanding of deep learning internals (attention, memory layout, compute graphs)
Hands-on experience with PyTorch (or similar) and model deployment
Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)
Experience scaling inference for real users (not just research benchmarks)
Comfortable working in fast-moving startup environments with ownership and ambiguity
Experience with LLM or long-context model inference
Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)
Experience optimizing across different hardware vendors
Open-source contributions in ML systems or inference tooling
Background in distributed systems or low-latency services

Benefits

Competitive compensation + meaningful equity at Series A

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

ML inference optimizationhigh-performance ML systemsdeep learning internalsPyTorchGPU performance tuningCUDATritonTensorRTONNX Runtimedistributed systems

Soft Skills

collaborationownershipadaptabilityproblem-solvingcommunication