DevOps Engineer

• Architect, automate, and scale the infrastructure for large-scale model training and research workflows.
• Design and run large-scale pre-training experiments for both dense and MoE architectures.
• Architect hybrid infrastructure solutions that span cloud and on-premises HPC environments.
• Automate configuration management and drift detection using tools like Ansible, Salt, or Chef.
• Build systems that reduce operational toil and establish guardrails for researchers.
• Build and own comprehensive CI/CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities.
• Develop tooling for developer workflows including reproducible builds, ephemeral environments, secrets management, and cluster resource allocation.
• Create self-service infrastructure patterns that empower researchers and engineers.
• Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility.
• Manage and extend HPC environments including GPU clusters, InfiniBand networks, job schedulers (Slurm/Kubernetes hybrid), and container orchestration.
• Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments.
• Optimize cluster scheduling and resource allocation for high-performance GPU workloads.
• Debug GPU driver quirks, Slurm job issues, and InfiniBand networking hiccups.
• Implement comprehensive monitoring, logging, and alerting across all infrastructure layers.
• Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long-horizon training runs.
• Build observability stacks for system health and job-level performance.
• Proactively detect and resolve infrastructure issues before they impact research workflows.
• Implement and manage secrets management and identity security solutions.
• Champion security best practices, IAM policies, and compliance standards.
• Document best practices, create runbooks, and evangelize DevOps culture across the organization.
• Mentor teammates on infrastructure patterns, automation techniques, and operational excellence.

Member of Technical Staff – DevOps, Infrastructure Engineering

Senior DevOps Engineer

DevOps Engineer

Staff DevOps Engineer

Staff SRE – Solana

Senior DevOps Engineer