Machine Learning Engineer, Platform

aion

full-time

Posted on: 12/17/2025

Location Type: Hybrid

Location: Bengaluru • 🇮🇳 India

Visit company website

✨ AI Apply

Apply

Job Level

Mid-LevelSenior

Tech Stack

AWSAzureCloudDockerFlashGoogle Cloud PlatformPythonPyTorch

About the role

Design and implement end-to-end LLMOps pipelines for model training, fine-tuning, and evaluation
Fine-tune and customize LLMs (Llama, Mistral, Gemma, etc.) using full fine-tuning and PEFT techniques (LoRA, QLoRA) with tools like Unsloth, Axolotl, and HuggingFace Transformers
Implement RLHF (Reinforcement Learning from Human Feedback) pipelines for model alignment and preference optimization
Design experiments for automated hyperparameter tuning, training strategies, and model selection
Prepare and validate training datasets—ensuring data quality, preprocessing, and format correctness
Build comprehensive model evaluation systems with custom metrics (BLEU, ROUGE, perplexity, accuracy) and develop synthetic data generation pipelines
Optimize model accuracy, token efficiency, and training performance through systematic experimentation
Design and maintain prompt engineering workflows with version control systems
Deploy models using vLLM with multi-adapter LoRA serving, hot-swapping, and basic optimizations (speculative decoding, continuous batching, KV cache management)
Set up ML-specific monitoring for model quality, drift detection, and performance tracking with automated retraining triggers
Manage model versioning, artifact storage, lineage tracking, and reproducibility using experiment tracking tools
Debug production model issues and optimize cost-performance trade-offs for training and inference
Partner with infrastructure engineers on ML-specific compute requirements and deployment pipelines
Document model development processes and share knowledge through internal tech talks

Requirements

4-6 years of hands-on experience in machine learning engineering or applied ML roles
Strong fine-tuning experience with modern LLMs—practical knowledge of transformer architectures, attention mechanisms, and both full fine-tuning and PEFT techniques (LoRA/QLoRA)
Deep understanding of transformer model architectures including modern variants (MoE, Grouped-Query Attention, Flash Attention, state space models)
Production ML experience—you've built and fine-tuned models for real-world applications
Proficiency in Python and ML frameworks (PyTorch, HuggingFace Transformers, PEFT, TRL) with hands-on experience in tools like Unsloth and Axolotl
Experience building model evaluation systems with metrics like BLEU, ROUGE, perplexity, and accuracy
Hands-on experience with prompt engineering, synthetic data generation, and data preprocessing pipelines
Basic deployment experience with vLLM including multi-adapter serving, hot-swapping, and inference optimizations
Understanding of GPU computing—memory management, multi-GPU training, mixed precision, gradient accumulation
Strong debugging skills for training failures, OOM errors, convergence issues, and data quality problems
Experience with model alignment techniques (RLHF, DPO) and implementing RLHF pipelines is highly desirable
Experience with distributed training (DeepSpeed, FSDP, DDP) is a plus
Knowledge of model quantization techniques (GPTQ, AWQ) and their impact on model quality is desirable
Prior experience with AWS SageMaker, MLflow for experiment tracking, and Weights & Biases is a strong plus
Exposure to cloud platforms (AWS/GCP/Azure) for training workloads is beneficial
Familiarity with Docker containerization for reproducible training environments

Benefits

Work directly with high-pedigree founders shaping technical and product strategy.
Build infrastructure powering the future of AI compute globally.
Significant ownership and impact with equity reflective of your contributions.
Competitive compensation, flexible work options, and wellness benefits.