Galileo 🔭

Software Engineer, LLM Inference

Galileo 🔭

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Manual Apply

Salary

💰 $180,000 - $300,000 per year

Job Level

Mid-LevelSenior

Tech Stack

ApacheDistributed SystemsKerasMicroservicesPythonPyTorchRayTensorflow

About the role

  • Design and scale inference infrastructure – architect and optimize distributed systems that serve LLMs at scale, ensuring low latency, high throughput, and cost efficiency.
  • Push the limits of performance – apply techniques like dynamic batching, concurrency optimization, precision reduction, and GPU kernel tuning to maximize throughput while maintaining quality.
  • Optimize model serving pipelines – work with TensorRT, layer fusion, kernel auto-tuning, and other advanced optimizations.
  • Build robust inference microservices – design runtime services (similar to NVIDIA Triton) to support multi-tenant, real-time inference workloads in production.
  • Experiment with cutting-edge frameworks – explore and integrate technologies like Apache Ray and distributed PyTorch/TensorFlow inference.
  • Collaborate with research & product teams to translate models into reliable, efficient, and observable services.
  • Shape best practices for running LLM workloads safely, reliably, and cost-effectively across diverse hardware.

Requirements

  • Experience building scalable machine learning compute systems and runtime microservices serving ML models at scale
  • Worked on large scale distributed systems
  • Experience with high throughput machine learning systems and platforms; bonus if worked on model serving systems
  • Excellent low-latency Python programming skills
  • Experience with model optimization techniques: dynamic batching and concurrency of inference requests
  • Experience using TensorRT to optimize models prior to deployment
  • Experience with precision reduction, layer fusion, kernel auto-tuning to reduce kernel and memory operations
  • Low-level GPU system optimizations
  • Built and scaled LLM inference servers (similar to NVIDIA Triton)
  • Bonus: experience with Apache Ray
  • Bonus: trained and run inference on models built on PyTorch, TensorFlow, Keras, and PyTorch Lightning