
Principal MLOps Engineer
Red Cell Partners
full-time
Posted on:
Location Type: Remote
Location: Remote • Virginia, Washington • 🇺🇸 United States
Visit company websiteSalary
💰 $200,000 - $250,000 per year
Job Level
Lead
Tech Stack
AWSAzureCloudDockerGoogle Cloud PlatformJenkinsKubernetesPythonTerraform
About the role
- Own the technical vision, strategy, and end-to-end architecture for Trase’s MLOps platform, ensuring scalability, reliability, security, and cost-efficiency.
- Architect and build a sophisticated CI/CD/CT ecosystem to automate the entire ML lifecycle, from data validation to production monitoring.
- Lead the design of scalable and resilient ML infrastructure using IaC (Terraform) and container orchestration (Kubernetes) on a major cloud platform.
- Establish MLOps best practices, including frameworks for version control, experiment tracking, model governance, and responsible AI.
- Implement a robust monitoring and alerting framework to track model performance, detect drift, and ensure the reliability of production ML services.
- Serve as the organization's thought leader on MLOps, mentoring engineers, and driving cross-functional alignment on platform strategy and best practices.
- Define the multi-year roadmap for Trase’s MLOps ecosystem in alignment with business and product strategy.
- Anticipate emerging trends (LLMOps, autoML, multi-cloud, federated learning) and guide the org to adopt them proactively.
- Define patterns for operating large-scale LLMs and multi-modal AI in production with efficiency and compliance.
- Solve highly ambiguous, large-scale ML deployment challenges where no precedent exists, defining best practices for the org.
- Focus on model training, pipeline development, and fine-tuning of large language models (LLMs) to ensure peak performance.
- Some travel is required.
Requirements
- 10+ years in software/infrastructure engineering, with 5+ years in a senior/lead MLOps, ML Infrastructure, or Platform role.
- Expertise in designing and operating scalable, production-grade ML systems on AWS, GCP, or Azure.
- Mastery of Docker and Kubernetes for managing production ML workloads.
- Proven experience managing complex infrastructure as code (IaC) with tools like Terraform.
- Deep experience architecting CI/CD/CT pipelines for complex ML workflows (e.g., GitHub Actions, Jenkins).
- Strong Python programming skills for infrastructure automation, tooling, and services.
- Experience architecting solutions across the full ML lifecycle, from experiment tracking to advanced deployment patterns and monitoring.
- Exceptional communication skills to articulate complex architectural strategy to stakeholders at all levels.
- Familiarity with modern MLOps tools like MLflow, Kubeflow, SageMaker, or Vertex AI.
- Experience with the operational challenges of LLMs, including fine-tuning pipelines, RAG systems, and vector databases.
Benefits
- 100% employer-paid, comprehensive health care including medical, dental, and vision for you and your family.
- Paid maternity and paternity for 14 weeks at employees' normal pay.
- Unlimited PTO, with management approval.
- Opportunities for professional development and continued learning with educational reimbursements.
- Optional 401K, FSA, and equity incentives available.
- Mental health benefits through TARA Mind.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
MLOpsML infrastructureCI/CDIaCTerraformKubernetesPythonDockerML lifecycleLLMs
Soft skills
communicationmentoringleadershipcross-functional alignmentstrategic thinkingproblem-solvingadaptabilitycollaborationstakeholder engagementthought leadership