
Senior Cloud Infrastructure Engineer
Hatch
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $158,000 - $216,000 per year
Job Level
Tech Stack
About the role
- Evolve our cloud infrastructure (AWS & GCP) using infrastructure-as-code tools like Terraform or Ansible.
- Implement systems that support the compute-heavy and storage-intensive needs of machine learning and data processing pipelines.
- Manage scalable, secure, and cost-efficient environments across dev, staging, and production.
- Participate in an on-call rotation.
- Collaborate with ML engineers to productionize models and manage workflows across training, testing, and deployment stages.
- Implement infrastructure to support versioning, orchestration, and monitoring of ML models in production (e.g. using tools like Kubeflow, SageMaker, VertexAI, or custom pipelines).
- Optimize data pipelines and model serving infrastructure for low-latency and high-throughput performance.
- Drive the strategy for observability, logging, and alerting across distributed systems.
- Lead incident response, root cause analysis, and system hardening for long-term resiliency.
- Implement best practices for infrastructure security, container hardening, and network architecture.
- Partner with engineering teams to bake DevOps best practices into the development lifecycle.
- Build tooling and automation that improves developer velocity, release stability, and system transparency.
Requirements
- 5+ years of experience in DevOps, SRE, or platform engineering roles in high-growth environments.
- 3+ years of experience with AWS infrastructure and services, including networking, IAM, ECS/EKS, and serverless computing.
- Strong experience with infrastructure-as-code (Terraform, Ansible) and CI/CD tooling (GitHub Actions, ArgoCD, etc.).
- Experience supporting machine learning teams or MLOps platforms (e.g. model training pipelines, feature stores, model registry, online inference).
- Strong knowledge of container orchestration (Kubernetes preferred) and observability stacks (Prometheus, Grafana, Sentry, DataDog, New Relic, etc.).
- Proven ability to participate in architectural conversations and contribute to large-scale infrastructure improvements.
- A bias toward simplicity, security, and reliability — you know when to build fast and when to build right.
- Familiarity with at least one programming language; Python, Go, Erlang, Rust, etc.
- Exposure to agentic programming workflows.
- RHCE, RHCSA, or equivalent certifications preferred.
Benefits
- Competitive salary
- Remote work environment
- Medical, dental, and vision benefits
- 401(k) plan + Match
- Flexible PTO
- Opportunity to build at the ground floor of a high-growth, mission-driven company
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AWSGCPTerraformAnsibleKubernetesCI/CDPythonGoECSEKS
Soft Skills
collaborationincident responseroot cause analysissystem hardeningobservabilitysecurityreliabilitysimplicitydeveloper velocityrelease stability
Certifications
RHCERHCSA