AI Research Engineer – Kernel, Inference Optimization

Tether.to

AI Research Engineer at Tether optimizing advanced AI systems for model serving and inference. Focused on delivering efficient performance across real-world applications.

Posted 5/19/2026full-timeRemote • 🇧🇷 BrazilMid-LevelSeniorWebsite

Tech Stack

Tools & technologies

Flash

About the role

Key responsibilities & impact

Drive innovation in model serving and inference architectures for advanced AI systems.
Focus on optimizing model deployment and inference strategies to deliver highly responsive, efficient, and scalable performance across real-world applications.
Work on a wide spectrum of systems, ranging from resource-efficient models designed for limited hardware environments to complex, multi-modal architectures that integrate data such as text, images, and audio.
Adopt a hands-on, research-driven approach to develop, test, and implement novel serving strategies and inference algorithms.
Engineer robust inference pipelines, establishing comprehensive performance metrics, and identifying and resolving bottlenecks in production environments.
Enable high-throughput, low-latency, low-memory footprint, and scalable AI performance that delivers tangible value in dynamic, real-world scenarios.

Requirements

What you’ll need

A degree in Computer Science or related field.
Ideally PhD in NLP, Machine Learning, or a related field, complemented by a solid track record in AI R&D (with good publications in A* conferences).
Must have knowledge of Metal Shading Language (MSL).
Proven experience in low-level kernel optimizations and inference optimization on mobile devices is essential.
Your contributions should have led to measurable improvements in inference latency, throughput, and memory footprint for domain-specific applications, particularly on resource-constrained devices and edge platforms.
A deep understanding of modern model serving architectures and inference optimization techniques is required.
Must have strong expertise in writing GPU kernels for mobile devices (i.e., smartphones) as well as a deep understanding of model serving frameworks and engines.
Practical experience in developing and deploying end-to-end inference pipelines, from optimizing models for efficient serving to integrating these solutions on resource-constrained devices is required.
Demonstrated ability to apply empirical research to overcome challenges in model serving, such as latency optimization, computational bottlenecks, and memory constraints.
Distributed Inference Systems: Designing and optimizing high-performance inference engines using techniques like Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism to handle massive models on GPU clusters.
Deep understanding of the math and structure behind Diffusion Models and Vision Transformers.
Understanding of Pruning, Quantization, Flash attention, KV Cache, Speculative Decoding (Eagle) etc.

Benefits

Comp & perks

Our team is a global talent powerhouse, working remotely from every corner of the world.

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

model serving architecturesinference optimizationMetal Shading Language (MSL)GPU kernelslow-level kernel optimizationsinference pipelinesTensor ParallelismPipeline ParallelismExpert ParallelismDiffusion Models

Soft Skills

research-driven approachproblem-solvingempirical research applicationinnovationcollaboration

Certifications

PhD in NLPPhD in Machine Learning