FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Site Reliability Engineer
Mistral AI. Balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems.
Tech Stack
Tools & technologiesCloudDistributed SystemsDockerFluxGoGrafanaKubernetesPrometheusPythonTerraform
About the role
Key responsibilities & impact- Balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems.
- Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads.
- Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters.
- Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.).
- Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime.
- Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs.
- Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences.
- Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform.
- Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments.
- Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure.
- Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.).
- Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements.
- Document processes and procedures to ensure consistency and knowledge sharing across the team.
- Contribute to open-source projects, research publications, blog articles and conferences.
Requirements
What you’ll need- Master’s degree in Computer Science, Engineering or a related field
- 7+ years of experience in a DevOps/SRE role
- Strong experience with cloud computing and highly available distributed systems
- Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)
- Experience working against reliability KPIs (observability, alerting, SLAs)
- Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)
- Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)
- Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
- Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices
- Strong understanding of networking, security, and system administration concepts
- Excellent problem-solving and communication skills
- Self-motivated and able to work well in a fast-paced startup environment
- Your application will be all the more interesting if you also have:
- experience in an AI/ML environment
- experience of high-performance computing (HPC) systems and workload managers (Slurm)
- worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)
Benefits
Comp & perks- 💰 Competitive salary and equity
- 🚑 Healthcare: Medical/Dental/Vision covered for you and your family
- 👴🏻 401K : 6% matching
- 🏝️ PTO : 18 days
- 🚗 Transportation: Reimburse office parking charges, or $120/month for public transport
- 🏀 Sport: $120/month reimbursement for gym membership
- 🥕 Meal stipend: $400 monthly allowance for meals
- 🌎 Visa sponsorship
- 🤝 Coaching: we offer BetterUp coaching on a voluntary basis
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
DevOpsSREcloud computingdistributed systemsCI/CDcontainerizationorchestrationscripting languagesnetworkingsystem administration
Soft Skills
problem-solvingcommunicationself-motivatedcollaboration
Certifications
Master’s degree in Computer ScienceMaster’s degree in Engineering