Wormhole

Site Reliability Engineer

Wormhole

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AWSCloudDistributed SystemsGoGoogle Cloud PlatformGrafanaJavaKubernetesRustSplunk

About the role

  • Act as first responder and incident commander during production incidents
  • Lead incident triage, root cause analysis, and retrospective documentation
  • Build detailed incident timelines and preventative runbooks
  • Respond to incidents related to: performance issues, CCQ failures or degraded throughput, observability pipeline outages, and core Wormhole products
  • Deliver remediation recommendations and implement approved fixes
  • Improve reliability and uptime across all Wormhole services
  • Strengthen observability, monitoring, and alerting systems
  • Harden infrastructure for security and operational resiliency
  • Enhance deployment workflows and reduce operational friction
  • Lead incident response, analysis, and continuous improvement
  • Support operational tooling used by engineering, DevOps, and validator partners

Requirements

  • Relevant tertiary qualifications in computer science or a closely related field (bachelors/masters) and/or relevant work experience over at least five years
  • Established experience as incident commander across multiple stakeholders in global team
  • Familiarity with metrics and log analysis tools (e.g., Grafana), incident response tools (e.g., PagerDuty), GitHub administration and related tools
  • Deep understanding of reliability engineering, observability, and incident response for distributed systems
  • Ability to write and debug code in any of the following: Go, Rust, Java
  • Strong experience operating in Grafana or Datadog or Splunk and/or Kubernetes in production environments
  • Experience securing distributed systems and public-facing infrastructure
  • Ability to operate independently, document clearly, and lead during incidents
  • Solid understanding of cloud computing environments (AWS and GCP preferred) and willingness to keep up to date with their changing offerings.
  • Excellent and proactive written and verbal communication
  • Ideal candidate will be based in ET or GMT time zone or the ability to work those hours.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
incident responseroot cause analysisreliability engineeringobservabilityGoRustJavacloud computingKuberneteslog analysis
Soft skills
leadershipcommunicationindependencedocumentationproactivecollaborationproblem-solvinganalytical thinkingcontinuous improvementincident command