CentralReach

Senior Site Reliability Engineer

CentralReach

full-time

Posted on:

Origin:  • 🇺🇸 United States • Florida

Visit company website
AI Apply
Manual Apply

Salary

💰 $140,000 - $180,000 per year

Job Level

Senior

Tech Stack

AnsibleAWSChefCloudGoGrafanaJavaKubernetesLinuxPrometheusPythonSplunkTerraform

About the role

  • Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, and capacity planning
  • Set and maintain SLOs, SLIs, Error Budgets and create dashboards
  • Analyze, troubleshoot and resolve operational challenges contributing to defined SLOs
  • Manage site stability, performance, reliability, and maintain uptime for production environments
  • Develop a fully automated multi-environment observability stack and extend it to predict capacity needs
  • Automate to reduce toil and increase development velocity
  • Provide application-specific production support, incident management, change management, problem management, RCAs, and service restoration
  • Identify architecture changes for reliability, performance, and availability using a data-driven approach
  • Document run books and standard operating procedures
  • Collaborate with software development teams on release management and operational readiness
  • Implement reliability and observability tools (New Relic, Prometheus, Grafana, etc.)

Requirements

  • Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider
  • Strong experience with AWS and Infrastructure as Code (Terraform, CloudFormation)
  • Understanding of High Availability best practices in AWS
  • Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic)
  • Experience with Prometheus and Grafana; implementing observability plans around logs, metrics, and traces
  • Extensive experience with Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef
  • Experience with release automation, system administration, and configuration management
  • Programming experience in Java, Python, Go (or similar)
  • Scripting experience with Bash and PowerShell
  • Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts
  • Experience with SLOs, SLIs, Error Budgets, dashboards, incident management, RCA, and change management