Hatch

Senior Cloud Infrastructure Engineer

Hatch

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $158,000 - $216,000 per year

Job Level

About the role

  • Evolve our cloud infrastructure (AWS & GCP) using infrastructure-as-code tools like Terraform or Ansible.
  • Implement systems that support the compute-heavy and storage-intensive needs of machine learning and data processing pipelines.
  • Manage scalable, secure, and cost-efficient environments across dev, staging, and production.
  • Participate in an on-call rotation.
  • Collaborate with ML engineers to productionize models and manage workflows across training, testing, and deployment stages.
  • Implement infrastructure to support versioning, orchestration, and monitoring of ML models in production (e.g. using tools like Kubeflow, SageMaker, VertexAI, or custom pipelines).
  • Optimize data pipelines and model serving infrastructure for low-latency and high-throughput performance.
  • Drive the strategy for observability, logging, and alerting across distributed systems.
  • Lead incident response, root cause analysis, and system hardening for long-term resiliency.
  • Implement best practices for infrastructure security, container hardening, and network architecture.
  • Partner with engineering teams to bake DevOps best practices into the development lifecycle.
  • Build tooling and automation that improves developer velocity, release stability, and system transparency.

Requirements

  • 5+ years of experience in DevOps, SRE, or platform engineering roles in high-growth environments.
  • 3+ years of experience with AWS infrastructure and services, including networking, IAM, ECS/EKS, and serverless computing.
  • Strong experience with infrastructure-as-code (Terraform, Ansible) and CI/CD tooling (GitHub Actions, ArgoCD, etc.).
  • Experience supporting machine learning teams or MLOps platforms (e.g. model training pipelines, feature stores, model registry, online inference).
  • Strong knowledge of container orchestration (Kubernetes preferred) and observability stacks (Prometheus, Grafana, Sentry, DataDog, New Relic, etc.).
  • Proven ability to participate in architectural conversations and contribute to large-scale infrastructure improvements.
  • A bias toward simplicity, security, and reliability — you know when to build fast and when to build right.
  • Familiarity with at least one programming language; Python, Go, Erlang, Rust, etc.
  • Exposure to agentic programming workflows.
  • RHCE, RHCSA, or equivalent certifications preferred.
Benefits
  • Competitive salary
  • Remote work environment
  • Medical, dental, and vision benefits
  • 401(k) plan + Match
  • Flexible PTO
  • Opportunity to build at the ground floor of a high-growth, mission-driven company
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
AWSGCPTerraformAnsibleKubernetesCI/CDPythonGoECSEKS
Soft Skills
collaborationincident responseroot cause analysissystem hardeningobservabilitysecurityreliabilitysimplicitydeveloper velocityrelease stability
Certifications
RHCERHCSA