Lead Inference Platform Support Engineer – AI

Thomson Reuters

. Optimize LLMs and ML models for high-performance inference using techniques such as quantization, pruning, distillation, and hardware specific tuning .

Posted 5/4/2026full-timeToronto • 🇨🇦 CanadaSenior💰 CA$140,000 - CA$175,000 per yearWebsite

Tech Stack

Tools & technologies

AWSAzureCloudDistributed SystemsGoogle Cloud PlatformKubernetesMicroservicesPythonPyTorchTensorflow

About the role

Key responsibilities & impact

Optimize LLMs and ML models for high-performance inference using techniques such as quantization, pruning, distillation, and hardware specific tuning
Deploy and scale inference workloads on GPUs across AWS, Azure, GCP and internal Kubernetes clusters, ensuring predictable performance during peak traffic hours, especially during business hours
Implement routing and failover strategies for OpenAI/Anthropic/Vertex AI traffic
Integrate models into production grade APIs supporting TR products and enterprise workflows
Develop highly optimized environment and eliminate performance bottlenecks to reduce latency
Collaborate with Platform Engineering teams (Landing Zones, Network, Storage, Compute, AI) to ensure inference workloads align with TR’s cloud native patterns (AWS, Azure, GCP, OCI)
Build and optimize containerized inference pipelines using Kubernetes for large‑scale distributed workloads
Ensure compliance with TR’s AI standards for deployment, monitoring, governance, and drift detection
Profile inference performance, identify GPU/CPU bottlenecks, and optimize compute utilization across heterogeneous hardware
Implement observability and health monitoring for inference pipelines, ensuring reliability of enterprise AI services
Collaborate with platform teams to enhance capacity forecasting for AI workloads
Work with Product, Data Science, Architecture, and Enterprise AI teams to onboard new research models into production
Collaborates closely with AI engineers to invent new quantization techniques, improve numerical precision, and explore non‑standard architectures
Partner with Cloud Engineers (Azure, AWS, GCP) to develop guardrails and automation that support inference workload
Support the scale out of AI infrastructure during critical releases and global product rollouts.

Requirements

What you’ll need

Strong understanding of ML/LLM fundamentals and inference optimization techniques
Hands-on experience with GPU programming (CUDA preferred), inference runtimes (TensorRT, ONNX Runtime), and deep learning frameworks (PyTorch/TensorFlow)
Proficiency in Python and at least one systems language (C++ strongly preferred for performance critical inference paths)
Experience deploying AI workloads to AWS/GCP/Azure and Kubernetes
Familiarity with vector search systems (OpenSearch vectors) and retrieval augmented generation pipelines
Knowledge of distributed systems, microservices, CI/CD, and cloud native architecture
Experience with AI networks, such as CNNs, transformers, and diffusion model architectures, and their performance characteristics
Understanding of GPU, Multithreading and/or other accelerators with vectorized instructions
Specialized experience in one or more of the following machine learning/deep learning domains: Model compression, hardware aware model optimizations, hardware accelerators architecture, GPU/ASIC architecture, machine learning compilers, high performance computing, performance optimizations, numerics and SW/HW co-design.

Benefits

Comp & perks

Flexible vacation
Two company-wide Mental Health Days off
Access to the Headspace app
Retirement savings
Tuition reimbursement
Employee incentive programs
Resources for mental, physical, and financial wellbeing

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

LLM optimizationML model optimizationquantizationpruningdistillationGPU programmingCUDAinference runtimesTensorRTONNX Runtime

Soft Skills

collaborationproblem-solvingcommunicationcapacity forecastingperformance optimization