
Site Reliability Engineer
Wormhole
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇺🇸 United States
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
AWSCloudDistributed SystemsGoGoogle Cloud PlatformGrafanaJavaKubernetesRustSplunk
About the role
- Act as first responder and incident commander during production incidents
- Lead incident triage, root cause analysis, and retrospective documentation
- Build detailed incident timelines and preventative runbooks
- Respond to incidents related to: performance issues, CCQ failures or degraded throughput, observability pipeline outages, and core Wormhole products
- Deliver remediation recommendations and implement approved fixes
- Improve reliability and uptime across all Wormhole services
- Strengthen observability, monitoring, and alerting systems
- Harden infrastructure for security and operational resiliency
- Enhance deployment workflows and reduce operational friction
- Lead incident response, analysis, and continuous improvement
- Support operational tooling used by engineering, DevOps, and validator partners
Requirements
- Relevant tertiary qualifications in computer science or a closely related field (bachelors/masters) and/or relevant work experience over at least five years
- Established experience as incident commander across multiple stakeholders in global team
- Familiarity with metrics and log analysis tools (e.g., Grafana), incident response tools (e.g., PagerDuty), GitHub administration and related tools
- Deep understanding of reliability engineering, observability, and incident response for distributed systems
- Ability to write and debug code in any of the following: Go, Rust, Java
- Strong experience operating in Grafana or Datadog or Splunk and/or Kubernetes in production environments
- Experience securing distributed systems and public-facing infrastructure
- Ability to operate independently, document clearly, and lead during incidents
- Solid understanding of cloud computing environments (AWS and GCP preferred) and willingness to keep up to date with their changing offerings.
- Excellent and proactive written and verbal communication
- Ideal candidate will be based in ET or GMT time zone or the ability to work those hours.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
incident responseroot cause analysisreliability engineeringobservabilityGoRustJavacloud computingKuberneteslog analysis
Soft skills
leadershipcommunicationindependencedocumentationproactivecollaborationproblem-solvinganalytical thinkingcontinuous improvementincident command