
Senior – Principal Site Reliability Engineer
DataCrunch
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇪🇺 Anywhere in Europe
Visit company websiteJob Level
Senior
Tech Stack
AnsibleAWSAzureCloudDistributed SystemsDNSGoGoogle Cloud PlatformLinuxPythonTerraform
About the role
- Ensure the reliability, scalability, and performance of HPC and cloud systems
- Build and maintain automation, observability, and monitoring frameworks for compute clusters
- Collaborate with ML, data, and infrastructure teams to deliver high-availability systems
- Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes
- Participate in architecture design and long-term infrastructure strategy discussions
- Participate in a 24/7 on-call rotation, with at least one full on-call week per month
Requirements
- 7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems
- Linux expertise (Ubuntu or Debian preferred)
- Strong experience with scripting and automation (Python, Go, Bash)
- Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius)
- Deep understanding of networking (DNS/TCP) and infrastructure-as-code tools (Terraform, Ansible)
- Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs
Benefits
- Generous cash + equity compensation
- Various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
SREDevOpsInfrastructure EngineeringLinuxPythonGoBashAWSGCPAzure
Soft skills
collaborationcommunicationproblem-solvingreliabilityscalabilityperformanceautomationobservabilitymonitoringarchitecture design