
SRE Lead
GetBlock
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Job Level
About the role
- Lead and grow the SRE team: hiring, onboarding, 1:1s, performance reviews, and career development.
- Own SRE operating cadence: prioritization, planning, execution, and visibility of reliability work.
- Maintain high standards for production readiness: runbooks, operational checklists, change management, and quality gates.
- Own production reliability end-to-end across gateways, clusters, and blockchain node fleets.
- Define and evolve SLIs/SLOs for uptime, response time, RPS, and time-to-resolve; partner with engineering teams to meet targets.
- Own incident management standards: alerting strategy, escalation, incident coordination, and communications.
- Run and improve postmortems: ensure follow-ups are executed and reliability debt is reduced over time.
- Lead capacity planning and performance work across regions and chains; balance reliability, speed, and cost.
- Lead design reviews and set engineering standards for reliability, scalability, and operational excellence.
- Drive architecture decisions across Nomad + Kubernetes environments, gateways, and observability stack.
- Build and evolve internal tooling that improves reliability and operational efficiency (automation, health systems, diagnostics, self-service).
Requirements
- 3+ years in SRE / infrastructure / production engineering, including 1+ year leading people
- Strong Linux, networking, and production incident debugging skills
- Experience running and scaling distributed, multi-region, high-load systems
- Hands-on with orchestration (Nomad and/or Kubernetes) and modern gateways/proxies
- Solid observability practices (metrics, logs, traces, alerting, incident response)
- Using AI agents to improve operational efficiency and reliability automation
- Strong communication and ability to lead technical decisions end to end
- Nice to have: Web3 / RPC infrastructure and blockchain node operations
- HashiCorp stack (Nomad, Consul, Vault), Prometheus ecosystem
- Terraform / IaC, capacity & cost modeling, DDoS and abuse protection
- Building internal platforms: self-service tools, runbooks, reliability automation.
Benefits
- 20 days of annual leave, plus an additional 12 days off to use for your holidays or personal days.
- Well-being programs to support your health and balance.
- Coworking space compensation for a productive work environment.
- Paid sick leave to ensure you can rest when needed.
- A company that invests in your growth, with personalized roadmaps to guide your professional development.
- An actively growing company with great opportunities for both horizontal and vertical career development.
- Opportunity to shape the initiatives you’re working on and make a real impact.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
SRELinuxnetworkingproduction incident debuggingorchestrationNomadKubernetesobservabilityTerraformreliability automation
Soft skills
leadershipcommunicationperformance reviewscareer developmentincident managementcapacity planningtechnical decision making