Salary
💰 $180,000 - $300,000 per year
Tech Stack
ApacheDistributed SystemsKerasMicroservicesPythonPyTorchRayTensorflow
About the role
- Design and scale inference infrastructure – architect and optimize distributed systems that serve LLMs at scale, ensuring low latency, high throughput, and cost efficiency.
- Push the limits of performance – apply techniques like dynamic batching, concurrency optimization, precision reduction, and GPU kernel tuning to maximize throughput while maintaining quality.
- Optimize model serving pipelines – work with TensorRT, layer fusion, kernel auto-tuning, and other advanced optimizations.
- Build robust inference microservices – design runtime services (similar to NVIDIA Triton) to support multi-tenant, real-time inference workloads in production.
- Experiment with cutting-edge frameworks – explore and integrate technologies like Apache Ray and distributed PyTorch/TensorFlow inference.
- Collaborate with research & product teams to translate models into reliable, efficient, and observable services.
- Shape best practices for running LLM workloads safely, reliably, and cost-effectively across diverse hardware.
Requirements
- Experience building scalable machine learning compute systems and runtime microservices serving ML models at scale
- Worked on large scale distributed systems
- Experience with high throughput machine learning systems and platforms; bonus if worked on model serving systems
- Excellent low-latency Python programming skills
- Experience with model optimization techniques: dynamic batching and concurrency of inference requests
- Experience using TensorRT to optimize models prior to deployment
- Experience with precision reduction, layer fusion, kernel auto-tuning to reduce kernel and memory operations
- Low-level GPU system optimizations
- Built and scaled LLM inference servers (similar to NVIDIA Triton)
- Bonus: experience with Apache Ray
- Bonus: trained and run inference on models built on PyTorch, TensorFlow, Keras, and PyTorch Lightning