aion

Machine Learning Engineer, Platform

aion

full-time

Posted on:

Location Type: Hybrid

Location: Bengaluru • 🇮🇳 India

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AWSAzureCloudDockerFlashGoogle Cloud PlatformPythonPyTorch

About the role

  • Design and implement end-to-end LLMOps pipelines for model training, fine-tuning, and evaluation
  • Fine-tune and customize LLMs (Llama, Mistral, Gemma, etc.) using full fine-tuning and PEFT techniques (LoRA, QLoRA) with tools like Unsloth, Axolotl, and HuggingFace Transformers
  • Implement RLHF (Reinforcement Learning from Human Feedback) pipelines for model alignment and preference optimization
  • Design experiments for automated hyperparameter tuning, training strategies, and model selection
  • Prepare and validate training datasets—ensuring data quality, preprocessing, and format correctness
  • Build comprehensive model evaluation systems with custom metrics (BLEU, ROUGE, perplexity, accuracy) and develop synthetic data generation pipelines
  • Optimize model accuracy, token efficiency, and training performance through systematic experimentation
  • Design and maintain prompt engineering workflows with version control systems
  • Deploy models using vLLM with multi-adapter LoRA serving, hot-swapping, and basic optimizations (speculative decoding, continuous batching, KV cache management)
  • Set up ML-specific monitoring for model quality, drift detection, and performance tracking with automated retraining triggers
  • Manage model versioning, artifact storage, lineage tracking, and reproducibility using experiment tracking tools
  • Debug production model issues and optimize cost-performance trade-offs for training and inference
  • Partner with infrastructure engineers on ML-specific compute requirements and deployment pipelines
  • Document model development processes and share knowledge through internal tech talks

Requirements

  • 4-6 years of hands-on experience in machine learning engineering or applied ML roles
  • Strong fine-tuning experience with modern LLMs—practical knowledge of transformer architectures, attention mechanisms, and both full fine-tuning and PEFT techniques (LoRA/QLoRA)
  • Deep understanding of transformer model architectures including modern variants (MoE, Grouped-Query Attention, Flash Attention, state space models)
  • Production ML experience—you've built and fine-tuned models for real-world applications
  • Proficiency in Python and ML frameworks (PyTorch, HuggingFace Transformers, PEFT, TRL) with hands-on experience in tools like Unsloth and Axolotl
  • Experience building model evaluation systems with metrics like BLEU, ROUGE, perplexity, and accuracy
  • Hands-on experience with prompt engineering, synthetic data generation, and data preprocessing pipelines
  • Basic deployment experience with vLLM including multi-adapter serving, hot-swapping, and inference optimizations
  • Understanding of GPU computing—memory management, multi-GPU training, mixed precision, gradient accumulation
  • Strong debugging skills for training failures, OOM errors, convergence issues, and data quality problems
  • Experience with model alignment techniques (RLHF, DPO) and implementing RLHF pipelines is highly desirable
  • Experience with distributed training (DeepSpeed, FSDP, DDP) is a plus
  • Knowledge of model quantization techniques (GPTQ, AWQ) and their impact on model quality is desirable
  • Prior experience with AWS SageMaker, MLflow for experiment tracking, and Weights & Biases is a strong plus
  • Exposure to cloud platforms (AWS/GCP/Azure) for training workloads is beneficial
  • Familiarity with Docker containerization for reproducible training environments
Benefits
  • Work directly with high-pedigree founders shaping technical and product strategy.
  • Build infrastructure powering the future of AI compute globally.
  • Significant ownership and impact with equity reflective of your contributions.
  • Competitive compensation, flexible work options, and wellness benefits.