
Lead Site Reliability Engineer
athenahealth
full-time
Posted on:
Location Type: Remote
Location: Massachusetts • United States
Visit company websiteExplore more
Salary
💰 $119,000 - $203,000 per year
Job Level
Tech Stack
About the role
- Define, measure, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for cloud services and infrastructure components.
- Lead efforts to continuously improve system availability, fault tolerance, and disaster recovery capabilities.
- Ensure proactive incident detection, efficient root cause analysis, and timely resolution of production incidents.
- Drive automation efforts to reduce manual intervention and streamline cloud infrastructure management.
- Design and implement monitoring, logging, and alerting solutions to track cloud infrastructure health, performance, and security.
- Ensure that cloud infrastructure is built with security best practices in mind and meets all relevant compliance and regulatory requirements.
- Work closely with development, DevOps, and operations teams to ensure cloud infrastructure aligns with application and business requirements.
- Lead the incident response efforts for cloud infrastructure-related issues, ensuring that all incidents are managed effectively.
Requirements
- 10 years of hands-on experience with cloud automation and configuration management tools (e.g., Terraform, AWS CloudFormation, Ansible, Puppet)
- 7+ years of experience in a Site Reliability Engineering (SRE), Infrastructure Engineering, or DevOps role, with at least 3+ years in a technical leadership capacity.
- Deep knowledge of cloud services and technologies (e.g., EC2, S3, Lambda, Kubernetes, etc.)
- Proficiency in scripting or programming languages (Python, Go, Bash, etc.)
- Experience with monitoring, logging, and observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack)
- Familiarity with Continuous Integration/Continuous Deployment (CI/CD) pipelines and cloud-native development practices.
- Strong expertise in managing cloud infrastructure (AWS, Google Cloud, Azure) in production environments.
- Experience with cloud-native architectures, microservices, and containerized environments (Kubernetes, Docker)
- Strong understanding of cloud networking, storage, compute services, On-Prem and security best practices
- Strong knowledge of Linux administration and internals
- Effective communication skills, with the ability to translate technical concepts to non-technical stakeholders.
Benefits
- Health insurance
- 401(k) matching
- Flexible work hours
- Paid time off
- Remote work options
- Employee assistance programs
- Tuition assistance
- Collaborative workspaces
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
cloud automationconfiguration managementSite Reliability EngineeringInfrastructure EngineeringDevOpscloud servicesscriptingcloud-native architecturesLinux administrationcloud networking
Soft Skills
technical leadershipeffective communication