GE Vernova

SRE Observability SLO Engineer

GE Vernova

full-time

Posted on:

Location Type: Hybrid

Location: RemoteUnited States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Implement organization-wide telemetry standards covering metrics, logs, and distributed traces across all GridOS SaaS services.
  • Implement metrics collection for Kubernetes-hosted services (EKS/Rancher) including pod-level, namespace-level, and cluster-level metrics.
  • Publish and maintain an Observability Runbook library covering onboarding, alert tuning, and dashboard standards for Platform SRE and Production DevOps teams.
  • Partner with product engineering, Platform SRE, and customer stakeholders to define meaningful Service Level Indicators (SLIs) and Service Level Objectives (SLOs) per product and customer tier.
  • Build and maintain SLO tooling — error budget burn-rate alerts, burn-rate dashboards, and automated SLO compliance reports.
  • Design and build operational dashboards covering availability, latency, error rates, and saturation (the 'Golden Signals') for every GridOS SaaS product.
  • Create executive-level dashboards for SRE leadership and customer-facing uptime/availability reports aligned to contractual SLAs.
  • Conduct periodic observability health reviews to identify gaps in coverage, reduce MTTD (Mean Time to Detect), and improve MTTR (Mean Time to Resolve).

Requirements

  • 2–3 years in SRE, observability engineering, or infrastructure reliability roles.
  • Deep expertise with at least one major observability platform — Datadog, Grafana + Prometheus, AWS CloudWatch, Dynatrace, or New Relic.
  • Hands-on experience implementing SLIs, SLOs, and error budget burn-rate alerting in a production SaaS environment.
  • Strong understanding of distributed systems telemetry: metrics (Prometheus/CloudWatch), structured logging (CloudWatch Logs Insights, ELK), and distributed tracing (OpenTelemetry, AWS X-Ray).
  • Experience with Kubernetes observability — kube-state-metrics, node exporters, Helm-deployed monitoring stacks, and namespace-level resource metrics.
  • Proficiency in at least one query/visualization language: PromQL, Splunk SPL, Datadog Query Language, or CloudWatch Logs Insights query syntax.
  • Experience designing alerting strategies that minimize alert fatigue through symptom-based and burn-rate approaches.
  • Scripting skills in Python and/or Bash for automation of monitoring configuration and report generation.
Benefits
  • Relocation Assistance Provided
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
observability engineeringSREKubernetesSLIsSLOserror budget alertingdistributed systems telemetrystructured loggingdistributed tracingscripting in Python
Soft Skills
collaborationcommunicationproblem-solvinganalytical thinkingattention to detail