
SRE Observability SLO Engineer
GE Vernova
full-time
Posted on:
Location Type: Hybrid
Location: Remote • United States
Visit company websiteExplore more
About the role
- Implement organization-wide telemetry standards covering metrics, logs, and distributed traces across all GridOS SaaS services.
- Implement metrics collection for Kubernetes-hosted services (EKS/Rancher) including pod-level, namespace-level, and cluster-level metrics.
- Publish and maintain an Observability Runbook library covering onboarding, alert tuning, and dashboard standards for Platform SRE and Production DevOps teams.
- Partner with product engineering, Platform SRE, and customer stakeholders to define meaningful Service Level Indicators (SLIs) and Service Level Objectives (SLOs) per product and customer tier.
- Build and maintain SLO tooling — error budget burn-rate alerts, burn-rate dashboards, and automated SLO compliance reports.
- Design and build operational dashboards covering availability, latency, error rates, and saturation (the 'Golden Signals') for every GridOS SaaS product.
- Create executive-level dashboards for SRE leadership and customer-facing uptime/availability reports aligned to contractual SLAs.
- Conduct periodic observability health reviews to identify gaps in coverage, reduce MTTD (Mean Time to Detect), and improve MTTR (Mean Time to Resolve).
Requirements
- 2–3 years in SRE, observability engineering, or infrastructure reliability roles.
- Deep expertise with at least one major observability platform — Datadog, Grafana + Prometheus, AWS CloudWatch, Dynatrace, or New Relic.
- Hands-on experience implementing SLIs, SLOs, and error budget burn-rate alerting in a production SaaS environment.
- Strong understanding of distributed systems telemetry: metrics (Prometheus/CloudWatch), structured logging (CloudWatch Logs Insights, ELK), and distributed tracing (OpenTelemetry, AWS X-Ray).
- Experience with Kubernetes observability — kube-state-metrics, node exporters, Helm-deployed monitoring stacks, and namespace-level resource metrics.
- Proficiency in at least one query/visualization language: PromQL, Splunk SPL, Datadog Query Language, or CloudWatch Logs Insights query syntax.
- Experience designing alerting strategies that minimize alert fatigue through symptom-based and burn-rate approaches.
- Scripting skills in Python and/or Bash for automation of monitoring configuration and report generation.
Benefits
- Relocation Assistance Provided
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
observability engineeringSREKubernetesSLIsSLOserror budget alertingdistributed systems telemetrystructured loggingdistributed tracingscripting in Python
Soft Skills
collaborationcommunicationproblem-solvinganalytical thinkingattention to detail