SRE Observability SLO Engineer

GE Vernova

full-time

Posted on: 3/25/2026

Location Type: Hybrid

Location: Remote • United States

Visit company website

Explore more

DevOps Engineer jobs

✨ AI Apply

Apply

Job Level

Junior Mid-Level

Tech Stack

AWS Distributed Systems Grafana Kubernetes Node.js Prometheus Python Ray Splunk

About the role

Implement organization-wide telemetry standards covering metrics, logs, and distributed traces across all GridOS SaaS services.
Implement metrics collection for Kubernetes-hosted services (EKS/Rancher) including pod-level, namespace-level, and cluster-level metrics.
Publish and maintain an Observability Runbook library covering onboarding, alert tuning, and dashboard standards for Platform SRE and Production DevOps teams.
Partner with product engineering, Platform SRE, and customer stakeholders to define meaningful Service Level Indicators (SLIs) and Service Level Objectives (SLOs) per product and customer tier.
Build and maintain SLO tooling — error budget burn-rate alerts, burn-rate dashboards, and automated SLO compliance reports.
Design and build operational dashboards covering availability, latency, error rates, and saturation (the 'Golden Signals') for every GridOS SaaS product.
Create executive-level dashboards for SRE leadership and customer-facing uptime/availability reports aligned to contractual SLAs.
Conduct periodic observability health reviews to identify gaps in coverage, reduce MTTD (Mean Time to Detect), and improve MTTR (Mean Time to Resolve).

Requirements

2–3 years in SRE, observability engineering, or infrastructure reliability roles.
Deep expertise with at least one major observability platform — Datadog, Grafana + Prometheus, AWS CloudWatch, Dynatrace, or New Relic.
Hands-on experience implementing SLIs, SLOs, and error budget burn-rate alerting in a production SaaS environment.
Strong understanding of distributed systems telemetry: metrics (Prometheus/CloudWatch), structured logging (CloudWatch Logs Insights, ELK), and distributed tracing (OpenTelemetry, AWS X-Ray).
Experience with Kubernetes observability — kube-state-metrics, node exporters, Helm-deployed monitoring stacks, and namespace-level resource metrics.
Proficiency in at least one query/visualization language: PromQL, Splunk SPL, Datadog Query Language, or CloudWatch Logs Insights query syntax.
Experience designing alerting strategies that minimize alert fatigue through symptom-based and burn-rate approaches.
Scripting skills in Python and/or Bash for automation of monitoring configuration and report generation.

Benefits

Relocation Assistance Provided

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

observability engineeringSREKubernetesSLIsSLOserror budget alertingdistributed systems telemetrystructured loggingdistributed tracingscripting in Python

Soft Skills

collaborationcommunicationproblem-solvinganalytical thinkingattention to detail