Scaleway

Site Reliability Engineer – HPC

Scaleway

full-time

Posted on:

Origin:  • 🇫🇷 France

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AnsibleChefCloudDNSGoGrafanaLinuxPostgresPrometheusPythonRustSaltStackTCP/IP

About the role

  • - Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents- Troubleshoot high-impact production issues in collaboration with other engineering teams
  • - Participate in an on-call rotation to handle incidents and ensure service continuity
  • - Implement and maintain observability solutions to monitor AI infrastructure and application health
  • - Contribute to AI infrastructure lifecycle management across different environments and countries
  • - Promote and apply best practices in terms of stability, resiliency, scalability, and security
  • - Maintain clear technical documentation for tools and procedures
  • - Contribute to system and tool evolution based on production feedback
  • - Collaborate closely with development teams to ensure infrastructure readiness- Participate in team rituals and knowledge-sharing initiatives

Requirements

  • - Experience with Go, Python or Rust
  • - Strong scripting skills (Bash, Python)
  • - Hands-on experience with Linux systems (Ubuntu/Debian)
  • - Hands-on experience with GPU & HPC infrastructure
  • - Knowledge of networking (TCP/IP, DNS, BGP, load-balancing, IPv6, etc.)
  • - Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.)
  • - Comfortable with Infrastructure-as-Code (Ansible, Salt, AWX, etc.)
  • - Experience managing relational databases (PostgreSQL)
  • - Understanding of CI/CD pipelines (GitLab)
  • - Comfortable with English (written and spoken)