SECURENTITY

Platform Engineer, AI/ML Infrastructure

SECURENTITY

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Manual Apply

Salary

💰 $160,000 - $220,000 per year

Job Level

Mid-LevelSenior

Tech Stack

AWSCloudGoJenkinsKubernetesPythonTerraform

About the role

  • Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services.
  • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated.
  • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources.
  • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing.
  • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments.
  • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning.
  • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle.
  • Automate the life cycle of single-tenant, managed deployments
  • Are passionate about building platforms that empower developers and researchers.
  • Enjoy creating elegant, automated solutions for complex infrastructure challenges in both cloud and data center environments.
  • Thrive on optimizing hybrid infrastructure for performance, cost, and reliability.
  • Are excited to work at the intersection of modern platform engineering and cutting-edge AI.
  • Love to treat infrastructure as a product, continuously improving the developer experience.
  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
  • Proven, hands-on experience building and managing production infrastructure with Terraform.
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management.
  • Strong scripting and automation skills (e.g., Python, Go, Bash).
  • Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling.
  • Familiarity with FinOps principles and cloud cost optimization strategies.
  • Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions.
  • Experience in a multi-region or hybrid cloud environment.

Requirements

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
  • Proven, hands-on experience building and managing production infrastructure with Terraform.
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management.
  • Strong scripting and automation skills (e.g., Python, Go, Bash).