Site Reliability Engineer, Technical Lead

Mistral AI

full-time

Posted on: 10/1/2025

Location Type: Hybrid

Location: Paris • 🇫🇷 France

Visit company website

✨ AI Apply

Apply

Job Level

Senior

Tech Stack

CloudDistributed SystemsDockerFluxGoGrafanaKubernetesPrometheusPythonTerraform

About the role

Lead Site Reliability Engineer responsible for driving the infrastructure team and reporting to the Head of Engineering.
Empower and supervise the SRE team: remove obstacles, hire, onboard, and elevate team performance; project planning and task allocation.
Collaborate with stakeholders across engineering, science, and product management.
Design, build, and maintain scalable, highly available, fault-tolerant infrastructures for web services and ML workloads.
Ensure platform, inference, and model training environments are highly available and reproducible across HPC clusters.
Operate production systems: troubleshooting, on-call responses, user admin, data extraction, infrastructure scaling; perform root cause analyses.
Implement and improve monitoring, alerting, and incident response systems to minimize downtime.
Implement and maintain CI/CD, containerization, orchestration, monitoring, logging and alerting workflows for client-facing APIs and large training runs.
Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform.
Collaborate with AI/ML researchers to enable safe and reproducible model-training experiments and build a cloud-agnostic platform abstraction layer.
Design and develop workflows, tooling, APIs, dashboards and automation to improve reliability and performance.
Collaborate with security to ensure best practices and compliance; document processes and contribute to open-source and publications.

Requirements

10+ years of experience in a DevOps/SRE role.
Experience with building and leading high-performing teams.
Experience with cloud computing and highly available distributed systems.
Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations).
Experience working against reliability KPIs (observability, alerting, SLAs).
Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes, Flux).
Experience with monitoring, logging and observability tools (Prometheus, Grafana, ELK Stack, Datadog).
Experience with infrastructure-as-code tools (Terraform, CloudFormation).
Proficiency in scripting languages (Python, Go, Bash).
Understanding of networking, security, and system administration concepts.
Excellent problem-solving and communication skills.
Self-motivated and able to work well in a fast-paced startup environment.
Willingness to reside in or relocate to Paris or London (candidates in France & UK may be considered remotely but must visit office during onboarding and monthly).

Benefits

💰 Competitive salary and equity
🧑‍⚕️ Health insurance
🚴 Transportation allowance
🥎 Sport allowance
🥕 Meal vouchers
💰 Private pension plan
🍼 Generous parental leave policy
🌎 Visa sponsorship
Accommodation and travelling covered for the first month of onboarding
Requirement to visit local office at least 3 days per month (after onboarding)

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

DevOpsSite Reliability Engineeringcloud computingdistributed systemsCI/CDcontainerizationorchestrationscripting languagesinfrastructure-as-codenetworking

Soft skills

problem-solvingcommunicationteam leadershipself-motivatedcollaborationproject planningtask allocationperformance improvementstakeholder engagementadaptability