Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Mistral AI

Site Reliability Engineer

Mistral AI

. Balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems.

Posted 4/21/2026full-timeRemote • New York • 🇺🇸 United StatesSeniorLeadWebsite

Tech Stack

Tools & technologies
CloudDistributed SystemsDockerFluxGoGrafanaKubernetesPrometheusPythonTerraform

About the role

Key responsibilities & impact
  • Balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems.
  • Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads.
  • Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters.
  • Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.).
  • Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime.
  • Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs.
  • Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences.
  • Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform.
  • Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments.
  • Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure.
  • Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.).
  • Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements.
  • Document processes and procedures to ensure consistency and knowledge sharing across the team.
  • Contribute to open-source projects, research publications, blog articles and conferences.

Requirements

What you’ll need
  • Master’s degree in Computer Science, Engineering or a related field
  • 7+ years of experience in a DevOps/SRE role
  • Strong experience with cloud computing and highly available distributed systems
  • Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)
  • Experience working against reliability KPIs (observability, alerting, SLAs)
  • Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)
  • Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)
  • Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
  • Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices
  • Strong understanding of networking, security, and system administration concepts
  • Excellent problem-solving and communication skills
  • Self-motivated and able to work well in a fast-paced startup environment
  • Your application will be all the more interesting if you also have:
  • experience in an AI/ML environment
  • experience of high-performance computing (HPC) systems and workload managers (Slurm)
  • worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)

Benefits

Comp & perks
  • 💰 Competitive salary and equity
  • 🚑 Healthcare: Medical/Dental/Vision covered for you and your family
  • 👴🏻 401K : 6% matching
  • 🏝️ PTO : 18 days
  • 🚗 Transportation: Reimburse office parking charges, or $120/month for public transport
  • 🏀 Sport: $120/month reimbursement for gym membership
  • 🥕 Meal stipend: $400 monthly allowance for meals
  • 🌎 Visa sponsorship
  • 🤝 Coaching: we offer BetterUp coaching on a voluntary basis

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
DevOpsSREcloud computingdistributed systemsCI/CDcontainerizationorchestrationscripting languagesnetworkingsystem administration
Soft Skills
problem-solvingcommunicationself-motivatedcollaboration
Certifications
Master’s degree in Computer ScienceMaster’s degree in Engineering