Site Reliability Engineer – Data Center & Infrastructure

Exegy

full-time

Posted on: 1/16/2026

Location Type: Hybrid

Location: St. Louis • Montana • United States

Visit company website

Explore more

DevOps Engineer jobs

✨ AI Apply

Apply

Job Level

Mid-Level Senior

Tech Stack

Ansible AWS Azure Cloud Google Cloud Platform Grafana Linux Prometheus Puppet Python Splunk Terraform VMware

About the role

Maintain and improve uptime across core systems including compute, storage, virtualization, load balancers, and data center network infrastructure
Support production services across on-prem data centers, co-locations, and hybrid cloud environments
Participate in 24×7 on-call rotation, major incident response, and post-mortems
Lead root cause analysis (RCA) and drive long-term remediation plans
Identify system failure patterns and implement hardening strategies
Develop and maintain automation using Ansible, Terraform, PowerShell, Python, Puppet, or similar tools
Automate operational workflows, configuration management, deployments, and failover testing
Implement and improve Infrastructure-as-Code (IaC) for consistency and reduced drift
Build and enhance monitoring across systems, networks, and applications (Prometheus, Grafana, Datadog, New Relic, SolarWinds, Splunk, etc.)
Improve alert fidelity, create health dashboards, and expand log aggregation
Conduct proactive performance tuning across hardware, virtualization, and OS layers (Windows/Linux)
Support physical and virtual data center infrastructure including racking/stacking, cabling, hardware lifecycle, and capacity planning
Own patching, firmware upgrades, refresh cycles, and vendor coordination
Support DR/BCP testing, multi-site failover architecture, and replication strategies
Maintain secure baseline configurations aligned to CIS Benchmarks, NIST, and ISO standards
Partner closely with Network, Security, DevOps, and Application Engineering teams to improve reliability end-to-end
Influence architecture decisions regarding capacity, resiliency, and scalability
Create and maintain runbooks, playbooks, standards, and operational documentation
Implement and maintain security controls including MFA, encryption, logging, PAM, and patch compliance
Support audit requirements for SOC 2, ISO 27001, CIS Controls, and internal governance obligations
Participate in vulnerability remediation efforts and system hardening

Requirements

Bachelor’s degree in Computer Science, Engineering, or equivalent experience
5+ years in Site Reliability Engineering, Systems Engineering, or Infrastructure Operations
Hands-on experience with VMware, Hyper-V, or similar virtualization technologies
Strong Linux and Windows server administration background
Experience with on-prem data centers, hardware lifecycle, and networking
Proficiency in automation and scripting (PowerShell, Bash, Python, Ansible, Terraform)
Experience with monitoring, logging, and observability platforms
Familiarity with AWS, Azure, or GCP in hybrid environments
Ability to participate in on-call rotation and support critical incidents.

Benefits

24×7 on-call rotation
Major incident response
Post-mortems
Root cause analysis (RCA) and long-term remediation plans
Security controls including MFA, encryption, logging, PAM, and patch compliance
Audit requirements for SOC 2, ISO 27001, CIS Controls, and vendor coordination

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

AnsibleTerraformPowerShellPythonPuppetVMwareHyper-VLinux server administrationWindows server administrationInfrastructure-as-Code (IaC)

Soft skills

leadershipproblem-solvingcommunicationcollaborationincident responseroot cause analysisperformance tuningdocumentationcapacity planningvendor coordination

Certifications

Bachelor’s degree in Computer ScienceISO 27001SOC 2CIS Controls