
Machine Learning Engineer, Platform
aion
full-time
Posted on:
Location Type: Hybrid
Location: Bengaluru • 🇮🇳 India
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
AWSAzureCloudDockerFlashGoogle Cloud PlatformPythonPyTorch
About the role
- Design and implement end-to-end LLMOps pipelines for model training, fine-tuning, and evaluation
- Fine-tune and customize LLMs (Llama, Mistral, Gemma, etc.) using full fine-tuning and PEFT techniques (LoRA, QLoRA) with tools like Unsloth, Axolotl, and HuggingFace Transformers
- Implement RLHF (Reinforcement Learning from Human Feedback) pipelines for model alignment and preference optimization
- Design experiments for automated hyperparameter tuning, training strategies, and model selection
- Prepare and validate training datasets—ensuring data quality, preprocessing, and format correctness
- Build comprehensive model evaluation systems with custom metrics (BLEU, ROUGE, perplexity, accuracy) and develop synthetic data generation pipelines
- Optimize model accuracy, token efficiency, and training performance through systematic experimentation
- Design and maintain prompt engineering workflows with version control systems
- Deploy models using vLLM with multi-adapter LoRA serving, hot-swapping, and basic optimizations (speculative decoding, continuous batching, KV cache management)
- Set up ML-specific monitoring for model quality, drift detection, and performance tracking with automated retraining triggers
- Manage model versioning, artifact storage, lineage tracking, and reproducibility using experiment tracking tools
- Debug production model issues and optimize cost-performance trade-offs for training and inference
- Partner with infrastructure engineers on ML-specific compute requirements and deployment pipelines
- Document model development processes and share knowledge through internal tech talks
Requirements
- 4-6 years of hands-on experience in machine learning engineering or applied ML roles
- Strong fine-tuning experience with modern LLMs—practical knowledge of transformer architectures, attention mechanisms, and both full fine-tuning and PEFT techniques (LoRA/QLoRA)
- Deep understanding of transformer model architectures including modern variants (MoE, Grouped-Query Attention, Flash Attention, state space models)
- Production ML experience—you've built and fine-tuned models for real-world applications
- Proficiency in Python and ML frameworks (PyTorch, HuggingFace Transformers, PEFT, TRL) with hands-on experience in tools like Unsloth and Axolotl
- Experience building model evaluation systems with metrics like BLEU, ROUGE, perplexity, and accuracy
- Hands-on experience with prompt engineering, synthetic data generation, and data preprocessing pipelines
- Basic deployment experience with vLLM including multi-adapter serving, hot-swapping, and inference optimizations
- Understanding of GPU computing—memory management, multi-GPU training, mixed precision, gradient accumulation
- Strong debugging skills for training failures, OOM errors, convergence issues, and data quality problems
- Experience with model alignment techniques (RLHF, DPO) and implementing RLHF pipelines is highly desirable
- Experience with distributed training (DeepSpeed, FSDP, DDP) is a plus
- Knowledge of model quantization techniques (GPTQ, AWQ) and their impact on model quality is desirable
- Prior experience with AWS SageMaker, MLflow for experiment tracking, and Weights & Biases is a strong plus
- Exposure to cloud platforms (AWS/GCP/Azure) for training workloads is beneficial
- Familiarity with Docker containerization for reproducible training environments
Benefits
- Work directly with high-pedigree founders shaping technical and product strategy.
- Build infrastructure powering the future of AI compute globally.
- Significant ownership and impact with equity reflective of your contributions.
- Competitive compensation, flexible work options, and wellness benefits.