Implement comprehensive monitoring, logging, and alerting across all infrastructure layers.
Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long-horizon training runs.
Build observability stacks for system health and job-level performance.
Proactively detect and resolve infrastructure issues before they impact research workflows.
Implement and manage secrets management and identity security solutions.
Champion security best practices, IAM policies, and compliance standards.
Document best practices, create runbooks, and evangelize DevOps culture across the organization.
Mentor teammates on infrastructure patterns, automation techniques, and operational excellence.
Requirements
Bachelor's or Master's degree in Computer Science, Engineering, or related field.
6-10+ years in DevOps, Infrastructure, or SRE roles with proven hands-on systems engineering experience (not just certification-based).
Deep Unix/Linux administration expertise including kernel tuning, networking, storage, and process control.
Advanced Infrastructure-as-Code experience with Terraform, Pulumi, or CloudFormation.
Expertise building CI/CD systems and reproducible build pipelines (GitHub Actions, GitLab CI, Jenkins, etc.).
Hands-on experience with AWS (EC2, S3, IAM, VPC, etc.) and cloud infrastructure management.
Cluster orchestration and job scheduling experience with Kubernetes and Slurm.
Strong monitoring and observability stack experience (Prometheus, Grafana, ELK/EFK, OpenTelemetry).
Demonstrated success scaling infrastructure for high-performance or GPU workloads.
Track record of managing GPU-accelerated clusters or HPC infrastructure.
Experience in automating workflows that reduced toil and scaling deployments safely.
Strong programming skills in at least one compiled/systems language (Python, Go, or Rust) plus Bash fluency.
Ability to work cross-functionally. Strong communicator who can simplify complex topics for diverse audiences.
Entrepreneurial & mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history.
Demonstrated passion for physics and for making scientific knowledge accessible and impactful.
Benefits
Join us at FirstPrinciples and be a part of a transformative journey where science drives progress and unlocks the potential of humanity.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.