
Site Reliability Engineer, Technical Lead
Mistral AI
full-time
Posted on:
Location Type: Hybrid
Location: Paris • 🇫🇷 France
Visit company websiteJob Level
Senior
Tech Stack
CloudDistributed SystemsDockerFluxGoGrafanaKubernetesPrometheusPythonTerraform
About the role
- Lead Site Reliability Engineer responsible for driving the infrastructure team and reporting to the Head of Engineering.
- Empower and supervise the SRE team: remove obstacles, hire, onboard, and elevate team performance; project planning and task allocation.
- Collaborate with stakeholders across engineering, science, and product management.
- Design, build, and maintain scalable, highly available, fault-tolerant infrastructures for web services and ML workloads.
- Ensure platform, inference, and model training environments are highly available and reproducible across HPC clusters.
- Operate production systems: troubleshooting, on-call responses, user admin, data extraction, infrastructure scaling; perform root cause analyses.
- Implement and improve monitoring, alerting, and incident response systems to minimize downtime.
- Implement and maintain CI/CD, containerization, orchestration, monitoring, logging and alerting workflows for client-facing APIs and large training runs.
- Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform.
- Collaborate with AI/ML researchers to enable safe and reproducible model-training experiments and build a cloud-agnostic platform abstraction layer.
- Design and develop workflows, tooling, APIs, dashboards and automation to improve reliability and performance.
- Collaborate with security to ensure best practices and compliance; document processes and contribute to open-source and publications.
Requirements
- 10+ years of experience in a DevOps/SRE role.
- Experience with building and leading high-performing teams.
- Experience with cloud computing and highly available distributed systems.
- Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations).
- Experience working against reliability KPIs (observability, alerting, SLAs).
- Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes, Flux).
- Experience with monitoring, logging and observability tools (Prometheus, Grafana, ELK Stack, Datadog).
- Experience with infrastructure-as-code tools (Terraform, CloudFormation).
- Proficiency in scripting languages (Python, Go, Bash).
- Understanding of networking, security, and system administration concepts.
- Excellent problem-solving and communication skills.
- Self-motivated and able to work well in a fast-paced startup environment.
- Willingness to reside in or relocate to Paris or London (candidates in France & UK may be considered remotely but must visit office during onboarding and monthly).
Benefits
- 💰 Competitive salary and equity
- 🧑⚕️ Health insurance
- 🚴 Transportation allowance
- 🥎 Sport allowance
- 🥕 Meal vouchers
- 💰 Private pension plan
- 🍼 Generous parental leave policy
- 🌎 Visa sponsorship
- Accommodation and travelling covered for the first month of onboarding
- Requirement to visit local office at least 3 days per month (after onboarding)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
DevOpsSite Reliability Engineeringcloud computingdistributed systemsCI/CDcontainerizationorchestrationscripting languagesinfrastructure-as-codenetworking
Soft skills
problem-solvingcommunicationteam leadershipself-motivatedcollaborationproject planningtask allocationperformance improvementstakeholder engagementadaptability