Salary
💰 $110,000 - $120,000 per year
Tech Stack
AWSCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesMicroservicesPythonPyTorchSQLTensorflow
About the role
- Design, implement, and maintain end-to-end pipelines for LLM training, fine-tuning, validation, and deployment
- Build and optimize scalable infrastructure for large language model operations
- Deploy LLMs to production environments with prompt management, observability, serverless deployment, proper monitoring, scaling, and performance optimization
- Design, develop, and maintain RESTful APIs endpoints for LLM inference and model interactions
- Ensure API reliability, performance optimization, rate limiting, authentication, and comprehensive documentation
- Implement comprehensive monitoring solutions for model performance, drift detection, and system health metrics
- Research and evaluate emerging LLMOps techniques, tools, and methodologies
- Provide informed recommendations on technology choices, architecture decisions, and implementation strategies
- Establish and document best practices for LLM operations, deployment patterns, and governance frameworks
- Develop prototypes and POCs to validate new approaches and technologies
- Work closely with data scientists, ML engineers, DevOps teams, and product managers
- Create comprehensive documentation for systems, processes, and architectural decisions
- Mentor team members and share expertise through technical presentations and training sessions
- Optimize data preprocessing and feature engineering pipelines for LLM training and inference
- Implement data validation, quality checks, and lineage tracking for model training datasets
- Design efficient data storage and retrieval systems for large-scale model artifacts and training data
- Implement model governance frameworks including audit trails, compliance monitoring, and approval workflows
- Ensure secure model deployment practices, access controls, and data privacy measures
- Identify and mitigate risks associated with LLM deployment and operations
- Maintain development, staging, and production environments for LLM workflows
Requirements
- Bachelor’s degree in Computer Science, Statistics, Engineering or a related field (exceptional candidates without advanced degrees will be considered)
- B.E/B.Tech/M.Tech in Computer Science or related technical degree OR Equivalent
- 6-12 years of experience building production-quality software (at least 5 years in Python) + 2 years in LLMOps
- 6+ years of software development experience with strong programming skills in Python, SQL
- 2+ years of hands-on experience LLMOps
- 1+ years of experience with machine learning operations, model deployment, and lifecycle management
- Proficiency with at least one major cloud provider (AWS or GCP) and their ML services
- Experience with Docker, Kubernetes, and container orchestration for ML workloads
- Strong experience in designing, building, and maintaining production-grade APIs for ML services
- Proficiency with Git, CI/CD pipelines, and DevOps practices
- Understanding of LLM architectures, training methodologies, and fine-tuning techniques
- Knowledge of ML pipeline design, model monitoring, and deployment strategies
- Understanding of distributed systems, scalability patterns, and microservices architecture
- Good-to-Have: Experience with HuggingFace Transformers, PyTorch, TensorFlow, or similar frameworks
- Good-to-Have: Knowledge of prompt optimization, RAG (Retrieval-Augmented Generation) architectures
- Good-to-Have: Experience with vector search