athenahealth

Lead Site Reliability Engineer

athenahealth

full-time

Posted on:

Location Type: Remote

Location: MassachusettsUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $119,000 - $203,000 per year

Job Level

About the role

  • Define, measure, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for cloud services and infrastructure components.
  • Lead efforts to continuously improve system availability, fault tolerance, and disaster recovery capabilities.
  • Ensure proactive incident detection, efficient root cause analysis, and timely resolution of production incidents.
  • Drive automation efforts to reduce manual intervention and streamline cloud infrastructure management.
  • Design and implement monitoring, logging, and alerting solutions to track cloud infrastructure health, performance, and security.
  • Ensure that cloud infrastructure is built with security best practices in mind and meets all relevant compliance and regulatory requirements.
  • Work closely with development, DevOps, and operations teams to ensure cloud infrastructure aligns with application and business requirements.
  • Lead the incident response efforts for cloud infrastructure-related issues, ensuring that all incidents are managed effectively.

Requirements

  • 10 years of hands-on experience with cloud automation and configuration management tools (e.g., Terraform, AWS CloudFormation, Ansible, Puppet)
  • 7+ years of experience in a Site Reliability Engineering (SRE), Infrastructure Engineering, or DevOps role, with at least 3+ years in a technical leadership capacity.
  • Deep knowledge of cloud services and technologies (e.g., EC2, S3, Lambda, Kubernetes, etc.)
  • Proficiency in scripting or programming languages (Python, Go, Bash, etc.)
  • Experience with monitoring, logging, and observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack)
  • Familiarity with Continuous Integration/Continuous Deployment (CI/CD) pipelines and cloud-native development practices.
  • Strong expertise in managing cloud infrastructure (AWS, Google Cloud, Azure) in production environments.
  • Experience with cloud-native architectures, microservices, and containerized environments (Kubernetes, Docker)
  • Strong understanding of cloud networking, storage, compute services, On-Prem and security best practices
  • Strong knowledge of Linux administration and internals
  • Effective communication skills, with the ability to translate technical concepts to non-technical stakeholders.
Benefits
  • Health insurance
  • 401(k) matching
  • Flexible work hours
  • Paid time off
  • Remote work options
  • Employee assistance programs
  • Tuition assistance
  • Collaborative workspaces
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
cloud automationconfiguration managementSite Reliability EngineeringInfrastructure EngineeringDevOpscloud servicesscriptingcloud-native architecturesLinux administrationcloud networking
Soft Skills
technical leadershipeffective communication