Brillio

LLMOps Engineer

Brillio

full-time

Posted on:

Origin:  • 🇺🇸 United States • Florida

Visit company website
AI Apply
Manual Apply

Salary

💰 $110,000 - $120,000 per year

Job Level

Mid-LevelSenior

Tech Stack

AWSCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesMicroservicesPythonPyTorchSQLTensorflow

About the role

  • Design, implement, and maintain end-to-end pipelines for LLM training, fine-tuning, validation, and deployment
  • Build and optimize scalable infrastructure for large language model operations
  • Deploy LLMs to production environments with prompt management, observability, serverless deployment, proper monitoring, scaling, and performance optimization
  • Design, develop, and maintain RESTful APIs endpoints for LLM inference and model interactions
  • Ensure API reliability, performance optimization, rate limiting, authentication, and comprehensive documentation
  • Implement comprehensive monitoring solutions for model performance, drift detection, and system health metrics
  • Research and evaluate emerging LLMOps techniques, tools, and methodologies
  • Provide informed recommendations on technology choices, architecture decisions, and implementation strategies
  • Establish and document best practices for LLM operations, deployment patterns, and governance frameworks
  • Develop prototypes and POCs to validate new approaches and technologies
  • Work closely with data scientists, ML engineers, DevOps teams, and product managers
  • Create comprehensive documentation for systems, processes, and architectural decisions
  • Mentor team members and share expertise through technical presentations and training sessions
  • Optimize data preprocessing and feature engineering pipelines for LLM training and inference
  • Implement data validation, quality checks, and lineage tracking for model training datasets
  • Design efficient data storage and retrieval systems for large-scale model artifacts and training data
  • Implement model governance frameworks including audit trails, compliance monitoring, and approval workflows
  • Ensure secure model deployment practices, access controls, and data privacy measures
  • Identify and mitigate risks associated with LLM deployment and operations
  • Maintain development, staging, and production environments for LLM workflows

Requirements

  • Bachelor’s degree in Computer Science, Statistics, Engineering or a related field (exceptional candidates without advanced degrees will be considered)
  • B.E/B.Tech/M.Tech in Computer Science or related technical degree OR Equivalent
  • 6-12 years of experience building production-quality software (at least 5 years in Python) + 2 years in LLMOps
  • 6+ years of software development experience with strong programming skills in Python, SQL
  • 2+ years of hands-on experience LLMOps
  • 1+ years of experience with machine learning operations, model deployment, and lifecycle management
  • Proficiency with at least one major cloud provider (AWS or GCP) and their ML services
  • Experience with Docker, Kubernetes, and container orchestration for ML workloads
  • Strong experience in designing, building, and maintaining production-grade APIs for ML services
  • Proficiency with Git, CI/CD pipelines, and DevOps practices
  • Understanding of LLM architectures, training methodologies, and fine-tuning techniques
  • Knowledge of ML pipeline design, model monitoring, and deployment strategies
  • Understanding of distributed systems, scalability patterns, and microservices architecture
  • Good-to-Have: Experience with HuggingFace Transformers, PyTorch, TensorFlow, or similar frameworks
  • Good-to-Have: Knowledge of prompt optimization, RAG (Retrieval-Augmented Generation) architectures
  • Good-to-Have: Experience with vector search