DataCrunch

Senior – Principal Site Reliability Engineer

DataCrunch

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇪🇺 Anywhere in Europe

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

AnsibleAWSAzureCloudDistributed SystemsDNSGoGoogle Cloud PlatformLinuxPythonTerraform

About the role

  • Ensure the reliability, scalability, and performance of HPC and cloud systems
  • Build and maintain automation, observability, and monitoring frameworks for compute clusters
  • Collaborate with ML, data, and infrastructure teams to deliver high-availability systems
  • Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes
  • Participate in architecture design and long-term infrastructure strategy discussions
  • Participate in a 24/7 on-call rotation, with at least one full on-call week per month

Requirements

  • 7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems
  • Linux expertise (Ubuntu or Debian preferred)
  • Strong experience with scripting and automation (Python, Go, Bash)
  • Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius)
  • Deep understanding of networking (DNS/TCP) and infrastructure-as-code tools (Terraform, Ansible)
  • Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs
Benefits
  • Generous cash + equity compensation
  • Various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.)

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
SREDevOpsInfrastructure EngineeringLinuxPythonGoBashAWSGCPAzure
Soft skills
collaborationcommunicationproblem-solvingreliabilityscalabilityperformanceautomationobservabilitymonitoringarchitecture design