Mistral AI

Site Reliability Engineer, Technical Lead

Mistral AI

full-time

Posted on:

Location Type: Hybrid

Location: Paris • 🇫🇷 France

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

CloudDistributed SystemsDockerFluxGoGrafanaKubernetesPrometheusPythonTerraform

About the role

  • Lead Site Reliability Engineer responsible for driving the infrastructure team and reporting to the Head of Engineering.
  • Empower and supervise the SRE team: remove obstacles, hire, onboard, and elevate team performance; project planning and task allocation.
  • Collaborate with stakeholders across engineering, science, and product management.
  • Design, build, and maintain scalable, highly available, fault-tolerant infrastructures for web services and ML workloads.
  • Ensure platform, inference, and model training environments are highly available and reproducible across HPC clusters.
  • Operate production systems: troubleshooting, on-call responses, user admin, data extraction, infrastructure scaling; perform root cause analyses.
  • Implement and improve monitoring, alerting, and incident response systems to minimize downtime.
  • Implement and maintain CI/CD, containerization, orchestration, monitoring, logging and alerting workflows for client-facing APIs and large training runs.
  • Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform.
  • Collaborate with AI/ML researchers to enable safe and reproducible model-training experiments and build a cloud-agnostic platform abstraction layer.
  • Design and develop workflows, tooling, APIs, dashboards and automation to improve reliability and performance.
  • Collaborate with security to ensure best practices and compliance; document processes and contribute to open-source and publications.

Requirements

  • 10+ years of experience in a DevOps/SRE role.
  • Experience with building and leading high-performing teams.
  • Experience with cloud computing and highly available distributed systems.
  • Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations).
  • Experience working against reliability KPIs (observability, alerting, SLAs).
  • Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes, Flux).
  • Experience with monitoring, logging and observability tools (Prometheus, Grafana, ELK Stack, Datadog).
  • Experience with infrastructure-as-code tools (Terraform, CloudFormation).
  • Proficiency in scripting languages (Python, Go, Bash).
  • Understanding of networking, security, and system administration concepts.
  • Excellent problem-solving and communication skills.
  • Self-motivated and able to work well in a fast-paced startup environment.
  • Willingness to reside in or relocate to Paris or London (candidates in France & UK may be considered remotely but must visit office during onboarding and monthly).
Benefits
  • 💰 Competitive salary and equity
  • 🧑‍⚕️ Health insurance
  • 🚴 Transportation allowance
  • 🥎 Sport allowance
  • 🥕 Meal vouchers
  • 💰 Private pension plan
  • 🍼 Generous parental leave policy
  • 🌎 Visa sponsorship
  • Accommodation and travelling covered for the first month of onboarding
  • Requirement to visit local office at least 3 days per month (after onboarding)

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
DevOpsSite Reliability Engineeringcloud computingdistributed systemsCI/CDcontainerizationorchestrationscripting languagesinfrastructure-as-codenetworking
Soft skills
problem-solvingcommunicationteam leadershipself-motivatedcollaborationproject planningtask allocationperformance improvementstakeholder engagementadaptability