Exegy

Site Reliability Engineer – Data Center & Infrastructure

Exegy

full-time

Posted on:

Location Type: Hybrid

Location: St. LouisMontanaUnited States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Maintain and improve uptime across core systems including compute, storage, virtualization, load balancers, and data center network infrastructure
  • Support production services across on-prem data centers, co-locations, and hybrid cloud environments
  • Participate in 24×7 on-call rotation, major incident response, and post-mortems
  • Lead root cause analysis (RCA) and drive long-term remediation plans
  • Identify system failure patterns and implement hardening strategies
  • Develop and maintain automation using Ansible, Terraform, PowerShell, Python, Puppet, or similar tools
  • Automate operational workflows, configuration management, deployments, and failover testing
  • Implement and improve Infrastructure-as-Code (IaC) for consistency and reduced drift
  • Build and enhance monitoring across systems, networks, and applications (Prometheus, Grafana, Datadog, New Relic, SolarWinds, Splunk, etc.)
  • Improve alert fidelity, create health dashboards, and expand log aggregation
  • Conduct proactive performance tuning across hardware, virtualization, and OS layers (Windows/Linux)
  • Support physical and virtual data center infrastructure including racking/stacking, cabling, hardware lifecycle, and capacity planning
  • Own patching, firmware upgrades, refresh cycles, and vendor coordination
  • Support DR/BCP testing, multi-site failover architecture, and replication strategies
  • Maintain secure baseline configurations aligned to CIS Benchmarks, NIST, and ISO standards
  • Partner closely with Network, Security, DevOps, and Application Engineering teams to improve reliability end-to-end
  • Influence architecture decisions regarding capacity, resiliency, and scalability
  • Create and maintain runbooks, playbooks, standards, and operational documentation
  • Implement and maintain security controls including MFA, encryption, logging, PAM, and patch compliance
  • Support audit requirements for SOC 2, ISO 27001, CIS Controls, and internal governance obligations
  • Participate in vulnerability remediation efforts and system hardening

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience
  • 5+ years in Site Reliability Engineering, Systems Engineering, or Infrastructure Operations
  • Hands-on experience with VMware, Hyper-V, or similar virtualization technologies
  • Strong Linux and Windows server administration background
  • Experience with on-prem data centers, hardware lifecycle, and networking
  • Proficiency in automation and scripting (PowerShell, Bash, Python, Ansible, Terraform)
  • Experience with monitoring, logging, and observability platforms
  • Familiarity with AWS, Azure, or GCP in hybrid environments
  • Ability to participate in on-call rotation and support critical incidents.
Benefits
  • 24×7 on-call rotation
  • Major incident response
  • Post-mortems
  • Root cause analysis (RCA) and long-term remediation plans
  • Security controls including MFA, encryption, logging, PAM, and patch compliance
  • Audit requirements for SOC 2, ISO 27001, CIS Controls, and vendor coordination

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
AnsibleTerraformPowerShellPythonPuppetVMwareHyper-VLinux server administrationWindows server administrationInfrastructure-as-Code (IaC)
Soft skills
leadershipproblem-solvingcommunicationcollaborationincident responseroot cause analysisperformance tuningdocumentationcapacity planningvendor coordination
Certifications
Bachelor’s degree in Computer ScienceISO 27001SOC 2CIS Controls