SRE Lead

GetBlock

full-time

Posted on: 1/6/2026

Location Type: Remote

✨ AI Apply

About the role

Lead and grow the SRE team: hiring, onboarding, 1:1s, performance reviews, and career development.
Own SRE operating cadence: prioritization, planning, execution, and visibility of reliability work.
Maintain high standards for production readiness: runbooks, operational checklists, change management, and quality gates.
Own production reliability end-to-end across gateways, clusters, and blockchain node fleets.
Define and evolve SLIs/SLOs for uptime, response time, RPS, and time-to-resolve; partner with engineering teams to meet targets.
Own incident management standards: alerting strategy, escalation, incident coordination, and communications.
Run and improve postmortems: ensure follow-ups are executed and reliability debt is reduced over time.
Lead capacity planning and performance work across regions and chains; balance reliability, speed, and cost.
Lead design reviews and set engineering standards for reliability, scalability, and operational excellence.
Drive architecture decisions across Nomad + Kubernetes environments, gateways, and observability stack.
Build and evolve internal tooling that improves reliability and operational efficiency (automation, health systems, diagnostics, self-service).

3+ years in SRE / infrastructure / production engineering, including 1+ year leading people
Strong Linux, networking, and production incident debugging skills
Experience running and scaling distributed, multi-region, high-load systems
Hands-on with orchestration (Nomad and/or Kubernetes) and modern gateways/proxies
Solid observability practices (metrics, logs, traces, alerting, incident response)
Using AI agents to improve operational efficiency and reliability automation
Strong communication and ability to lead technical decisions end to end
Nice to have: Web3 / RPC infrastructure and blockchain node operations
HashiCorp stack (Nomad, Consul, Vault), Prometheus ecosystem
Terraform / IaC, capacity & cost modeling, DDoS and abuse protection
Building internal platforms: self-service tools, runbooks, reliability automation.

Benefits

20 days of annual leave, plus an additional 12 days off to use for your holidays or personal days.
Well-being programs to support your health and balance.
Coworking space compensation for a productive work environment.
Paid sick leave to ensure you can rest when needed.
A company that invests in your growth, with personalized roadmaps to guide your professional development.
An actively growing company with great opportunities for both horizontal and vertical career development.
Opportunity to shape the initiatives you’re working on and make a real impact.

Tip: use these terms in your resume and cover letter to boost ATS matches.

SRELinuxnetworkingproduction incident debuggingorchestrationNomadKubernetesobservabilityTerraformreliability automation

leadershipcommunicationperformance reviewscareer developmentincident managementcapacity planningtechnical decision making