GE Vernova

SRE Platform Engineer

GE Vernova

full-time

Posted on:

Location Type: Hybrid

Location: United States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Provision & Infrastructure Hardening Kubernetes Cluster Orchestration: Help design and deploy hardened EKS clusters across multiple AWS regions, ensuring consistent security baselines.
  • Build and maintain reusable Terraform and Ansible modules for automated provisioning of cloud infrastructure services including networking services, compute, storage, queue and cache, etc.
  • Implement "Policy as Code" guardrails and secure network perimeters (ESPs) in alignment with NERC CIP and IEC 62443 standards.
  • Standardize run books, operating processes required to run critical infrastructure with highest reliability.
  • Define and enforce Kubernetes resource quotas, limit ranges, and Pod Priority classes to ensure mission-critical services receive prioritized compute resources.
  • Manage the ingress strategy and service mesh architecture to facilitate secure, performant connectivity between distributed micro services.
  • Lead platform-level smoke, load testing and disaster recovery exercises to validate that the infrastructure can meet 99.99% uptime targets.
  • Partner with application teams to right-size containerized workloads, optimizing for both performance and cloud cost (FinOps).
  • Act as the highest technical escalation point for complex Kubernetes internals, troubleshooting issues such as failed pods, memory leaks, and network partitions.
  • Lead root cause analysis (RCA) for platform-level outages, implementing systemic fixes to prevent recurring failures.
  • Proactively identify and automate repetitive operational tasks—such as cluster upgrades and OS patching—to ensure the team spends at least 50% of their time on engineering improvements.
  • Institutionalize platform monitoring using Prometheus and Grafana, creating dashboards that surface the "Golden Signals" of cluster health.

Requirements

  • 5 years of experience operating production-grade Kubernetes clusters at scale.
  • Expert-level knowledge of multi-cluster management, performance tuning and experience implementing observability tools such as Prometheus/Grafana, Dynatrace, Splunk, Datadog, etc.
  • Deep hands-on experience with AWS core services (EKS, EC2, ALB, S3, RDS, MSK).
  • Proficiency in Terraform, Ansible, and Python or Go for infrastructure automation and deployment tools like ArgoCD or Flux.
  • Strong understanding and hands on experience of cloud networking concepts such as VPCs, routing, load balancing and security configurations such as encryption, certificate management.
  • Bachelor's Degree in Computer Science or “STEM” Majors (Science, Technology, Engineering and Math) with advanced experience.
  • 6–8 years in SRE or Platform Engineering roles supporting mission-critical, 24/7 cloud environments.
  • Proven track record as a structured incident responder who can handle production down/break the glass scenarios in mission critical applications.
  • Practical knowledge of NERC CIP, SOC2, ISO 27001, or IEC 62443 compliance standards in a SaaS context.
  • AWS Certified DevOps Engineer – Professional, CKA (Certified Kubernetes Administrator), or SRE Practitioner Certification.
  • Experience supporting mission-critical systems in energy, utilities, or other high-stakes industrial sectors.
  • Ability to work with global teams, act independently and as part of a team.
Benefits
  • Relocation Assistance Provided
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
KubernetesTerraformAnsiblePythonGoAWScloud networkingobservability toolsincident responseinfrastructure automation
Soft Skills
leadershipproblem-solvingcommunicationcollaborationindependencestructured incident responseproactive identificationroot cause analysis
Certifications
AWS Certified DevOps Engineer – ProfessionalCKA (Certified Kubernetes Administrator)SRE Practitioner Certification