
Site Reliability Engineer – Data Center & Infrastructure
Exegy
full-time
Posted on:
Location Type: Hybrid
Location: St. Louis • Montana • United States
Visit company websiteExplore more
Tech Stack
About the role
- Maintain and improve uptime across core systems including compute, storage, virtualization, load balancers, and data center network infrastructure
- Support production services across on-prem data centers, co-locations, and hybrid cloud environments
- Participate in 24×7 on-call rotation, major incident response, and post-mortems
- Lead root cause analysis (RCA) and drive long-term remediation plans
- Identify system failure patterns and implement hardening strategies
- Develop and maintain automation using Ansible, Terraform, PowerShell, Python, Puppet, or similar tools
- Automate operational workflows, configuration management, deployments, and failover testing
- Implement and improve Infrastructure-as-Code (IaC) for consistency and reduced drift
- Build and enhance monitoring across systems, networks, and applications (Prometheus, Grafana, Datadog, New Relic, SolarWinds, Splunk, etc.)
- Improve alert fidelity, create health dashboards, and expand log aggregation
- Conduct proactive performance tuning across hardware, virtualization, and OS layers (Windows/Linux)
- Support physical and virtual data center infrastructure including racking/stacking, cabling, hardware lifecycle, and capacity planning
- Own patching, firmware upgrades, refresh cycles, and vendor coordination
- Support DR/BCP testing, multi-site failover architecture, and replication strategies
- Maintain secure baseline configurations aligned to CIS Benchmarks, NIST, and ISO standards
- Partner closely with Network, Security, DevOps, and Application Engineering teams to improve reliability end-to-end
- Influence architecture decisions regarding capacity, resiliency, and scalability
- Create and maintain runbooks, playbooks, standards, and operational documentation
- Implement and maintain security controls including MFA, encryption, logging, PAM, and patch compliance
- Support audit requirements for SOC 2, ISO 27001, CIS Controls, and internal governance obligations
- Participate in vulnerability remediation efforts and system hardening
Requirements
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience
- 5+ years in Site Reliability Engineering, Systems Engineering, or Infrastructure Operations
- Hands-on experience with VMware, Hyper-V, or similar virtualization technologies
- Strong Linux and Windows server administration background
- Experience with on-prem data centers, hardware lifecycle, and networking
- Proficiency in automation and scripting (PowerShell, Bash, Python, Ansible, Terraform)
- Experience with monitoring, logging, and observability platforms
- Familiarity with AWS, Azure, or GCP in hybrid environments
- Ability to participate in on-call rotation and support critical incidents.
Benefits
- 24×7 on-call rotation
- Major incident response
- Post-mortems
- Root cause analysis (RCA) and long-term remediation plans
- Security controls including MFA, encryption, logging, PAM, and patch compliance
- Audit requirements for SOC 2, ISO 27001, CIS Controls, and vendor coordination
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
AnsibleTerraformPowerShellPythonPuppetVMwareHyper-VLinux server administrationWindows server administrationInfrastructure-as-Code (IaC)
Soft skills
leadershipproblem-solvingcommunicationcollaborationincident responseroot cause analysisperformance tuningdocumentationcapacity planningvendor coordination
Certifications
Bachelor’s degree in Computer ScienceISO 27001SOC 2CIS Controls