
Senior ML Infrastructure – DevOps Engineer
Pathway
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇫🇷 France
Visit company websiteJob Level
Senior
Tech Stack
AirflowAWSAzureCloudDNSDockerGoogle Cloud PlatformGrafanaJenkinsKubernetesLinuxPrometheusPythonPyTorchShell ScriptingTensorflowTerraform
About the role
- Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management).
- Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management.
- Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback.
- Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services.
- Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch).
- Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges.
- Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems.
- Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break.
Requirements
- Former or current Linux / systems / network administrator who is comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing).
- 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads.
- Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services.
- What we are looking for
- Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments.
- Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch.
- Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI).
- Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations.
- Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents).
- Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management.
- Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer.
- Strong ownership mindset, comfort with ambiguity, and enthusiasm for scaling and hardening critical infrastructure for an ML‑heavy environment.
- Willingness to learn.
Benefits
- Intellectually stimulating work environment. Be a pioneer: you get to work with realtime data processing & AI.
- Work in one of the hottest AI startups, with exciting career prospects. Team members are distributed across the world.
- Responsibilities and ability to make significant contribution to the company’ success
- Inclusive workplace culture
- Further details
- - **Type of contract**: Permanent employment contract
- - **Preferable joining date**: Immediate.
- - **Compensation**: based on profile and location.
- - **Location**: Remote work. Possibility to work or meet with other team members in one of our offices: Palo Alto, CA; Paris, France or Wroclaw, Poland. Candidates based anywhere in the EU, United States, and Canada will be considered.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
GPU clustersCPU clustersML pipelinesinfrastructure as codeCI/CDshell scriptingworkload managementcontainerizationprogramming in Pythonmonitoring and logging
Soft skills
strong ownership mindsetcomfort with ambiguityenthusiasm for scalinglead incident responsecollaboration with ML engineers