FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Lead Inference Platform Support Engineer – AI
Thomson Reuters. Optimize LLMs and ML models for high-performance inference using techniques such as quantization, pruning, distillation, and hardware specific tuning .
Tech Stack
Tools & technologiesAWSAzureCloudDistributed SystemsGoogle Cloud PlatformKubernetesMicroservicesPythonPyTorchTensorflow
About the role
Key responsibilities & impact- Optimize LLMs and ML models for high-performance inference using techniques such as quantization, pruning, distillation, and hardware specific tuning
- Deploy and scale inference workloads on GPUs across AWS, Azure, GCP and internal Kubernetes clusters, ensuring predictable performance during peak traffic hours, especially during business hours
- Implement routing and failover strategies for OpenAI/Anthropic/Vertex AI traffic
- Integrate models into production grade APIs supporting TR products and enterprise workflows
- Develop highly optimized environment and eliminate performance bottlenecks to reduce latency
- Collaborate with Platform Engineering teams (Landing Zones, Network, Storage, Compute, AI) to ensure inference workloads align with TR’s cloud native patterns (AWS, Azure, GCP, OCI)
- Build and optimize containerized inference pipelines using Kubernetes for large‑scale distributed workloads
- Ensure compliance with TR’s AI standards for deployment, monitoring, governance, and drift detection
- Profile inference performance, identify GPU/CPU bottlenecks, and optimize compute utilization across heterogeneous hardware
- Implement observability and health monitoring for inference pipelines, ensuring reliability of enterprise AI services
- Collaborate with platform teams to enhance capacity forecasting for AI workloads
- Work with Product, Data Science, Architecture, and Enterprise AI teams to onboard new research models into production
- Collaborates closely with AI engineers to invent new quantization techniques, improve numerical precision, and explore non‑standard architectures
- Partner with Cloud Engineers (Azure, AWS, GCP) to develop guardrails and automation that support inference workload
- Support the scale out of AI infrastructure during critical releases and global product rollouts.
Requirements
What you’ll need- Strong understanding of ML/LLM fundamentals and inference optimization techniques
- Hands-on experience with GPU programming (CUDA preferred), inference runtimes (TensorRT, ONNX Runtime), and deep learning frameworks (PyTorch/TensorFlow)
- Proficiency in Python and at least one systems language (C++ strongly preferred for performance critical inference paths)
- Experience deploying AI workloads to AWS/GCP/Azure and Kubernetes
- Familiarity with vector search systems (OpenSearch vectors) and retrieval augmented generation pipelines
- Knowledge of distributed systems, microservices, CI/CD, and cloud native architecture
- Experience with AI networks, such as CNNs, transformers, and diffusion model architectures, and their performance characteristics
- Understanding of GPU, Multithreading and/or other accelerators with vectorized instructions
- Specialized experience in one or more of the following machine learning/deep learning domains: Model compression, hardware aware model optimizations, hardware accelerators architecture, GPU/ASIC architecture, machine learning compilers, high performance computing, performance optimizations, numerics and SW/HW co-design.
Benefits
Comp & perks- Flexible vacation
- Two company-wide Mental Health Days off
- Access to the Headspace app
- Retirement savings
- Tuition reimbursement
- Employee incentive programs
- Resources for mental, physical, and financial wellbeing
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
LLM optimizationML model optimizationquantizationpruningdistillationGPU programmingCUDAinference runtimesTensorRTONNX Runtime
Soft Skills
collaborationproblem-solvingcommunicationcapacity forecastingperformance optimization