GetBlock

SRE Lead

GetBlock

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Lead and grow the SRE team: hiring, onboarding, 1:1s, performance reviews, and career development.
  • Own SRE operating cadence: prioritization, planning, execution, and visibility of reliability work.
  • Maintain high standards for production readiness: runbooks, operational checklists, change management, and quality gates.
  • Own production reliability end-to-end across gateways, clusters, and blockchain node fleets.
  • Define and evolve SLIs/SLOs for uptime, response time, RPS, and time-to-resolve; partner with engineering teams to meet targets.
  • Own incident management standards: alerting strategy, escalation, incident coordination, and communications.
  • Run and improve postmortems: ensure follow-ups are executed and reliability debt is reduced over time.
  • Lead capacity planning and performance work across regions and chains; balance reliability, speed, and cost.
  • Lead design reviews and set engineering standards for reliability, scalability, and operational excellence.
  • Drive architecture decisions across Nomad + Kubernetes environments, gateways, and observability stack.
  • Build and evolve internal tooling that improves reliability and operational efficiency (automation, health systems, diagnostics, self-service).

Requirements

  • 3+ years in SRE / infrastructure / production engineering, including 1+ year leading people
  • Strong Linux, networking, and production incident debugging skills
  • Experience running and scaling distributed, multi-region, high-load systems
  • Hands-on with orchestration (Nomad and/or Kubernetes) and modern gateways/proxies
  • Solid observability practices (metrics, logs, traces, alerting, incident response)
  • Using AI agents to improve operational efficiency and reliability automation
  • Strong communication and ability to lead technical decisions end to end
  • Nice to have: Web3 / RPC infrastructure and blockchain node operations
  • HashiCorp stack (Nomad, Consul, Vault), Prometheus ecosystem
  • Terraform / IaC, capacity & cost modeling, DDoS and abuse protection
  • Building internal platforms: self-service tools, runbooks, reliability automation.
Benefits
  • 20 days of annual leave, plus an additional 12 days off to use for your holidays or personal days.
  • Well-being programs to support your health and balance.
  • Coworking space compensation for a productive work environment.
  • Paid sick leave to ensure you can rest when needed.
  • A company that invests in your growth, with personalized roadmaps to guide your professional development.
  • An actively growing company with great opportunities for both horizontal and vertical career development.
  • Opportunity to shape the initiatives you’re working on and make a real impact.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
SRELinuxnetworkingproduction incident debuggingorchestrationNomadKubernetesobservabilityTerraformreliability automation
Soft skills
leadershipcommunicationperformance reviewscareer developmentincident managementcapacity planningtechnical decision making