FirstPrinciples Holding Company

Member of Technical Staff – DevOps, Infrastructure Engineering

FirstPrinciples Holding Company

full-time

Posted on:

Location Type: Remote

Location: Remote • 🌎 Anywhere in the World

Visit company website
AI Apply
Apply

Job Level

Lead

Tech Stack

AnsibleAWSChefCloudDockerEC2GoGrafanaJenkinsKubernetesLinuxPrometheusPythonRustSaltStackTerraformUnix

About the role

  • Architect, automate, and scale the infrastructure for large-scale model training and research workflows.
  • Design and run large-scale pre-training experiments for both dense and MoE architectures.
  • Architect hybrid infrastructure solutions that span cloud and on-premises HPC environments.
  • Automate configuration management and drift detection using tools like Ansible, Salt, or Chef.
  • Build systems that reduce operational toil and establish guardrails for researchers.
  • Build and own comprehensive CI/CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities.
  • Develop tooling for developer workflows including reproducible builds, ephemeral environments, secrets management, and cluster resource allocation.
  • Create self-service infrastructure patterns that empower researchers and engineers.
  • Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility.
  • Manage and extend HPC environments including GPU clusters, InfiniBand networks, job schedulers (Slurm/Kubernetes hybrid), and container orchestration.
  • Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments.
  • Optimize cluster scheduling and resource allocation for high-performance GPU workloads.
  • Debug GPU driver quirks, Slurm job issues, and InfiniBand networking hiccups.
  • Implement comprehensive monitoring, logging, and alerting across all infrastructure layers.
  • Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long-horizon training runs.
  • Build observability stacks for system health and job-level performance.
  • Proactively detect and resolve infrastructure issues before they impact research workflows.
  • Implement and manage secrets management and identity security solutions.
  • Champion security best practices, IAM policies, and compliance standards.
  • Document best practices, create runbooks, and evangelize DevOps culture across the organization.
  • Mentor teammates on infrastructure patterns, automation techniques, and operational excellence.

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or related field.
  • 6-10+ years in DevOps, Infrastructure, or SRE roles with proven hands-on systems engineering experience (not just certification-based).
  • Deep Unix/Linux administration expertise including kernel tuning, networking, storage, and process control.
  • Advanced Infrastructure-as-Code experience with Terraform, Pulumi, or CloudFormation.
  • Expertise building CI/CD systems and reproducible build pipelines (GitHub Actions, GitLab CI, Jenkins, etc.).
  • Hands-on experience with AWS (EC2, S3, IAM, VPC, etc.) and cloud infrastructure management.
  • Cluster orchestration and job scheduling experience with Kubernetes and Slurm.
  • Strong monitoring and observability stack experience (Prometheus, Grafana, ELK/EFK, OpenTelemetry).
  • Demonstrated success scaling infrastructure for high-performance or GPU workloads.
  • Track record of managing GPU-accelerated clusters or HPC infrastructure.
  • Experience in automating workflows that reduced toil and scaling deployments safely.
  • Strong programming skills in at least one compiled/systems language (Python, Go, or Rust) plus Bash fluency.
  • Ability to work cross-functionally. Strong communicator who can simplify complex topics for diverse audiences.
  • Entrepreneurial & mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history.
  • Demonstrated passion for physics and for making scientific knowledge accessible and impactful.
Benefits
  • Join us at FirstPrinciples and be a part of a transformative journey where science drives progress and unlocks the potential of humanity.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
infrastructure automationmodel trainingHPC managementCI/CD pipelinesInfrastructure-as-CodeUnix/Linux administrationprogramming (Python, Go, Rust)monitoring and observabilityGPU workload optimizationsecrets management
Soft skills
strong communicationcross-functional collaborationmentoringentrepreneurial mindsetmission-drivensimplifying complex topicsoperational excellenceevangelizing DevOps culturepassion for physicsadaptability in fast-paced environments
Search Atlas

Lead DevOps Engineer

Search Atlas
Seniorfull-time$6k–$7k🌎 Anywhere in the World
Posted: 6 hours agoSource: search-atlas.breezy.hr
CloudElasticSearchGoogle Cloud PlatformGrafanaKubernetesMicroservicesPostgresTerraform
P2P Labs & P2P Tech Services

Senior SRE Engineer

P2P Labs & P2P Tech Services
Seniorcontract🌎 Anywhere in the World
Posted: 13 hours agoSource: jobs.ashbyhq.com
GoKubernetesPrometheusPythonShell ScriptingVault
Chess.com

Senior SRE – Distributed Systems, Cloud Infrastructure

Chess.com
Seniorfull-time🌎 Anywhere in the World
Posted: 4 days agoSource: ats.rippling.com
CloudDistributed SystemsGoKubernetesTerraformTypeScript