
SRE Platform Engineer
GE Vernova
full-time
Posted on:
Location Type: Hybrid
Location: United States
Visit company websiteExplore more
About the role
- Provision & Infrastructure Hardening Kubernetes Cluster Orchestration: Help design and deploy hardened EKS clusters across multiple AWS regions, ensuring consistent security baselines.
- Build and maintain reusable Terraform and Ansible modules for automated provisioning of cloud infrastructure services including networking services, compute, storage, queue and cache, etc.
- Implement "Policy as Code" guardrails and secure network perimeters (ESPs) in alignment with NERC CIP and IEC 62443 standards.
- Standardize run books, operating processes required to run critical infrastructure with highest reliability.
- Define and enforce Kubernetes resource quotas, limit ranges, and Pod Priority classes to ensure mission-critical services receive prioritized compute resources.
- Manage the ingress strategy and service mesh architecture to facilitate secure, performant connectivity between distributed micro services.
- Lead platform-level smoke, load testing and disaster recovery exercises to validate that the infrastructure can meet 99.99% uptime targets.
- Partner with application teams to right-size containerized workloads, optimizing for both performance and cloud cost (FinOps).
- Act as the highest technical escalation point for complex Kubernetes internals, troubleshooting issues such as failed pods, memory leaks, and network partitions.
- Lead root cause analysis (RCA) for platform-level outages, implementing systemic fixes to prevent recurring failures.
- Proactively identify and automate repetitive operational tasks—such as cluster upgrades and OS patching—to ensure the team spends at least 50% of their time on engineering improvements.
- Institutionalize platform monitoring using Prometheus and Grafana, creating dashboards that surface the "Golden Signals" of cluster health.
Requirements
- 5 years of experience operating production-grade Kubernetes clusters at scale.
- Expert-level knowledge of multi-cluster management, performance tuning and experience implementing observability tools such as Prometheus/Grafana, Dynatrace, Splunk, Datadog, etc.
- Deep hands-on experience with AWS core services (EKS, EC2, ALB, S3, RDS, MSK).
- Proficiency in Terraform, Ansible, and Python or Go for infrastructure automation and deployment tools like ArgoCD or Flux.
- Strong understanding and hands on experience of cloud networking concepts such as VPCs, routing, load balancing and security configurations such as encryption, certificate management.
- Bachelor's Degree in Computer Science or “STEM” Majors (Science, Technology, Engineering and Math) with advanced experience.
- 6–8 years in SRE or Platform Engineering roles supporting mission-critical, 24/7 cloud environments.
- Proven track record as a structured incident responder who can handle production down/break the glass scenarios in mission critical applications.
- Practical knowledge of NERC CIP, SOC2, ISO 27001, or IEC 62443 compliance standards in a SaaS context.
- AWS Certified DevOps Engineer – Professional, CKA (Certified Kubernetes Administrator), or SRE Practitioner Certification.
- Experience supporting mission-critical systems in energy, utilities, or other high-stakes industrial sectors.
- Ability to work with global teams, act independently and as part of a team.
Benefits
- Relocation Assistance Provided
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesTerraformAnsiblePythonGoAWScloud networkingobservability toolsincident responseinfrastructure automation
Soft Skills
leadershipproblem-solvingcommunicationcollaborationindependencestructured incident responseproactive identificationroot cause analysis
Certifications
AWS Certified DevOps Engineer – ProfessionalCKA (Certified Kubernetes Administrator)SRE Practitioner Certification