Salary
💰 $60 - $65 per hour
Tech Stack
AWSCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesMicroservicesPythonPyTorchSQLTensorflow
About the role
- Design, implement, and maintain end-to-end pipelines for LLM training, fine-tuning, validation, and deployment
- Build and optimize scalable infrastructure for large language model operations
- Deploy LLMs to production environments with prompt management, observability, serverless deployment, monitoring, scaling, and performance optimization
- Design, develop, and maintain RESTful API endpoints for LLM inference and model interactions
- Ensure API reliability, performance optimization, rate limiting, authentication, and comprehensive documentation
- Implement comprehensive monitoring solutions for model performance, drift detection, and system health metrics
- Research and evaluate emerging LLMOps techniques, tools, and methodologies and provide recommendations on technology and architecture
- Establish and document best practices for LLM operations, deployment patterns, and governance frameworks
- Develop prototypes and POCs to validate new approaches and technologies
- Collaborate closely with data scientists, ML engineers, DevOps teams, and product managers
- Create comprehensive documentation for systems, processes, and architectural decisions
- Mentor team members and share expertise through technical presentations and training sessions
- Optimize data preprocessing and feature engineering pipelines for LLM training and inference
- Implement data validation, quality checks, and lineage tracking for model training datasets
- Design efficient data storage and retrieval systems for large-scale model artifacts and training data
- Implement model governance frameworks including audit trails, compliance monitoring, and approval workflows
- Ensure secure model deployment practices, access controls, and data privacy measures
- Identify and mitigate risks associated with LLM deployment and operations
- Maintain development, staging, and production environments for LLM workflows
Requirements
- Bachelor’s degree in Computer Science, Statistics, Engineering or a related field (B.E/B.Tech/M.Tech) or Equivalent
- LLMOps Engineer with software engineering experience
- 6-12 years of experience building production-quality software (minimum 6 years)
- At least 5 years of experience in Python
- 6+ years of software development experience with strong programming skills in Python and SQL
- 2+ years of hands-on experience in LLMOps
- 1+ years of experience with machine learning operations, model deployment, and lifecycle management
- Proficiency with at least one major cloud provider (AWS or GCP) and their ML services
- Experience with Docker, Kubernetes, and container orchestration for ML workloads
- Strong experience in designing, building, and maintaining production-grade APIs for ML services
- Proficiency with Git, CI/CD pipelines, and DevOps practices
- Understanding of LLM architectures, training methodologies, and fine-tuning techniques
- Knowledge of ML pipeline design, model monitoring, and deployment strategies
- Understanding of distributed systems, scalability patterns, and microservices architecture
- "Good-to-have": Experience with HuggingFace Transformers, PyTorch, TensorFlow, or similar frameworks
- "Good-to-have": Knowledge of prompt optimization, RAG (Retrieval-Augmented Generation) architectures
- "Good-to-have": Experience with vector search
- Note: Exceptional candidates without advanced degrees will be considered