Senior Site Reliability Engineer

CentralReach

full-time

Posted on: 8/28/2025

Location: Florida • 🇺🇸 United States

✨ AI Apply

💰 $140,000 - $180,000 per year

Senior

AnsibleAWSChefCloudGoGrafanaJavaKubernetesLinuxPrometheusPythonSplunkTerraform

About the role

Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, and capacity planning
Set and maintain SLOs, SLIs, Error Budgets and create dashboards
Analyze, troubleshoot and resolve operational challenges contributing to defined SLOs
Manage site stability, performance, reliability, and maintain uptime for production environments
Develop a fully automated multi-environment observability stack and extend it to predict capacity needs
Automate to reduce toil and increase development velocity
Provide application-specific production support, incident management, change management, problem management, RCAs, and service restoration
Identify architecture changes for reliability, performance, and availability using a data-driven approach
Document run books and standard operating procedures
Collaborate with software development teams on release management and operational readiness
Implement reliability and observability tools (New Relic, Prometheus, Grafana, etc.)

Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider
Strong experience with AWS and Infrastructure as Code (Terraform, CloudFormation)
Understanding of High Availability best practices in AWS
Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic)
Experience with Prometheus and Grafana; implementing observability plans around logs, metrics, and traces
Extensive experience with Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef
Experience with release automation, system administration, and configuration management
Programming experience in Java, Python, Go (or similar)
Scripting experience with Bash and PowerShell
Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts
Experience with SLOs, SLIs, Error Budgets, dashboards, incident management, RCA, and change management