Full-stack Engineer

• Design and scale inference infrastructure – architect and optimize distributed systems that serve LLMs at scale, ensuring low latency, high throughput, and cost efficiency.
• Push the limits of performance – apply techniques like dynamic batching, concurrency optimization, precision reduction, and GPU kernel tuning to maximize throughput while maintaining quality.
• Optimize model serving pipelines – work with TensorRT, layer fusion, kernel auto-tuning, and other advanced optimizations.
• Build robust inference microservices – design runtime services (similar to NVIDIA Triton) to support multi-tenant, real-time inference workloads in production.
• Experiment with cutting-edge frameworks – explore and integrate technologies like Apache Ray and distributed PyTorch/TensorFlow inference.
• Collaborate with research & product teams to translate models into reliable, efficient, and observable services.
• Shape best practices for running LLM workloads safely, reliably, and cost-effectively across diverse hardware.

Software Engineer, LLM Inference

Salary

Job Level

Tech Stack

About the role

Requirements

Principal Software Engineer, Connectivity