Evolve our cloud infrastructure (AWS & GCP) using infrastructure-as-code tools like Terraform or Ansible.
Implement systems that support the compute-heavy and storage-intensive needs of machine learning and data processing pipelines.
Manage scalable, secure, and cost-efficient environments across dev, staging, and production.
Participate in an on-call rotation.
Collaborate with ML engineers to productionize models and manage workflows across training, testing, and deployment stages.
Implement infrastructure to support versioning, orchestration, and monitoring of ML models in production (e.g. using tools like Kubeflow, SageMaker, VertexAI, or custom pipelines).
Optimize data pipelines and model serving infrastructure for low-latency and high-throughput performance.
Drive the strategy for observability, logging, and alerting across distributed systems.
Lead incident response, root cause analysis, and system hardening for long-term resiliency.
Implement best practices for infrastructure security, container hardening, and network architecture.
Partner with engineering teams to bake DevOps best practices into the development lifecycle.
Build tooling and automation that improves developer velocity, release stability, and system transparency.

Requirements

5+ years of experience in DevOps, SRE, or platform engineering roles in high-growth environments.
3+ years of experience with AWS infrastructure and services, including networking, IAM, ECS/EKS, and serverless computing.
Strong experience with infrastructure-as-code (Terraform, Ansible) and CI/CD tooling (GitHub Actions, ArgoCD, etc.).
Experience supporting machine learning teams or MLOps platforms (e.g. model training pipelines, feature stores, model registry, online inference).
Strong knowledge of container orchestration (Kubernetes preferred) and observability stacks (Prometheus, Grafana, Sentry, DataDog, New Relic, etc.).
Proven ability to participate in architectural conversations and contribute to large-scale infrastructure improvements.
A bias toward simplicity, security, and reliability — you know when to build fast and when to build right.
Familiarity with at least one programming language; Python, Go, Erlang, Rust, etc.
Exposure to agentic programming workflows.
RHCE, RHCSA, or equivalent certifications preferred.

Benefits

Competitive salary
Remote work environment
Medical, dental, and vision benefits
401(k) plan + Match
Flexible PTO
Opportunity to build at the ground floor of a high-growth, mission-driven company

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

AWSGCPTerraformAnsibleKubernetesCI/CDPythonGoECSEKS

Soft Skills

collaborationincident responseroot cause analysissystem hardeningobservabilitysecurityreliabilitysimplicitydeveloper velocityrelease stability

Certifications

RHCERHCSA