Galileo 🔭

Software Engineer, LLM Inference

Galileo 🔭

full-time

Posted on:

Location: California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $180,000 - $300,000 per year

Job Level

Mid-LevelSenior

Tech Stack

ApacheDistributed SystemsKerasMicroservicesPythonPyTorchRayTensorflow

About the role

  • Design and scale inference infrastructure – architect and optimize distributed systems that serve LLMs at scale, ensuring low latency, high throughput, and cost efficiency.
  • Push the limits of performance – apply techniques like dynamic batching, concurrency optimization, precision reduction, and GPU kernel tuning to maximize throughput while maintaining quality.
  • Optimize model serving pipelines – work with TensorRT, layer fusion, kernel auto-tuning, and other advanced optimizations.
  • Build robust inference microservices – design runtime services (similar to NVIDIA Triton) to support multi-tenant, real-time inference workloads in production.
  • Experiment with cutting-edge frameworks – explore and integrate technologies like Apache Ray and distributed PyTorch/TensorFlow inference.
  • Collaborate with research & product teams to translate models into reliable, efficient, and observable services.
  • Shape best practices for running LLM workloads safely, reliably, and cost-effectively across diverse hardware.

Requirements

  • Experience building scalable machine learning compute systems and runtime microservices serving ML models at scale
  • Worked on large scale distributed systems
  • Experience with high throughput machine learning systems and platforms; bonus if worked on model serving systems
  • Excellent low-latency Python programming skills
  • Experience with model optimization techniques: dynamic batching and concurrency of inference requests
  • Experience using TensorRT to optimize models prior to deployment
  • Experience with precision reduction, layer fusion, kernel auto-tuning to reduce kernel and memory operations
  • Low-level GPU system optimizations
  • Built and scaled LLM inference servers (similar to NVIDIA Triton)
  • Bonus: experience with Apache Ray
  • Bonus: trained and run inference on models built on PyTorch, TensorFlow, Keras, and PyTorch Lightning
Safran

Principal Software Engineer, Connectivity

Safran
Leadfull-time$165k–$187k / yearCalifornia · 🇺🇸 United States
Posted: 34 minutes agoSource: apply.workable.com
JavaScriptPython
PrePass

Software Engineer

PrePass
Mid · Seniorfull-timeArizona · 🇺🇸 United States
Posted: 39 minutes agoSource: apply.workable.com
AzureCloud.NETSQL
Evertune AI

Senior Software Engineer, Full Stack

Evertune AI
Seniorfull-time$140k–$200k / yearNew York · 🇺🇸 United States
Posted: 55 minutes agoSource: jobs.ashbyhq.com
CloudGoogle Cloud PlatformJavaScriptPythonVue.js
AirGarage

Senior Embedded Software Engineer

AirGarage
Seniorfull-time$180k–$210k / year🇺🇸 United States
Posted: 1 hour agoSource: jobs.ashbyhq.com
DockerGrafanaKafkaLinuxPrometheusPythonRedis
Newfire Global Partners

Staff Software Engineer

Newfire Global Partners
Leadfull-time🇺🇸 United States
Posted: 2 hours agoSource: newfireglobal.pinpointhq.com
AndroidiOSJavaScriptNode.jsReactReact Native