Lead SRE – Observability

athenahealth

Lead Site Reliability Engineer enhancing observability and telemetry platform for athenahealth's cloud infrastructure. Collaborating with engineering teams to improve reliability and operational efficiency.

Posted 6/10/2026full-timeRemote • Massachusetts • 🇺🇸 United StatesSenior💰 $143,000 - $243,000 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

Linux systems engineeringcloud infrastructureSRE practicesobservability platformstelemetry platformsInfrastructure as Codeautomation solutionsPythonGolangBash

Soft Skills

communicationmentoringtechnical decision-makingleadershipcollaborationproblem-solvingincident responseroot cause analysisoperational excellencecontinuous improvement

Tools & Technologies

OpenSearchElasticsearchKafkaPrometheusGrafanaVectorFluentdOpenTelemetryClickHouseTerraform

Industry Keywords

scalable systemsproduction infrastructuredistributed systemsmonitoring strategytelemetry pipelinescloud-native environmentsAWSautomationreliabilityperformance

Tech Stack

Tools & technologies

AWSCloudDistributed SystemsElasticSearchGoGrafanaKafkaLinuxPrometheusPythonTerraform

About the role

Key responsibilities & impact

Build and operate scalable observability and telemetry platforms that process logs, metrics, traces, and events across production environments
Support monitoring, alerting, and instrumentation strategies that improve service visibility and operational insight
Partner with engineering teams to strengthen telemetry collection and overall observability
Design resilient, automated infrastructure and platform services that improve reliability, scalability, and efficiency
Develop Infrastructure as Code and automation solutions that reduce toil and improve consistency
Lead technical initiatives from architecture through implementation with attention to performance, reliability, security, and maintainability
Troubleshoot complex production issues involving distributed systems, Linux infrastructure, networking, cloud services, and telemetry pipelines
Participate in incident response and on-call processes
Help drive operational excellence, root cause analysis, and continuous improvement
Mentor engineers on SRE best practices, observability strategy, and scalable systems design
Contribute to long-term platform strategy and reliability improvements.

Requirements

What you’ll need

7+ years of experience operating and engineering large-scale production infrastructure and distributed systems
Strong expertise in Linux systems engineering, cloud infrastructure, and SRE practices
Proven experience designing and operating observability and telemetry platforms
Hands-on experience with tools such as OpenSearch/Elasticsearch, Kafka, Prometheus, Grafana, Vector, Fluentd, OpenTelemetry, ClickHouse, or similar
Experience building Infrastructure as Code solutions using Terraform, CloudFormation, or equivalent tooling
Strong automation and software engineering skills using Python, Golang, or Bash
Experience troubleshooting large-scale distributed systems in production with a focus on availability, performance, scalability, and resiliency
Experience operating services in cloud-native environments, including AWS and containerized platforms
Strong understanding of monitoring strategy, telemetry pipelines, incident response, root cause analysis, and operational excellence
Ability to communicate effectively across engineering organizations and influence technical decision-making.

Benefits

Comp & perks

Health and financial benefits
Tuition assistance
Employee resource groups
Collaborative workspaces
Flexible work-life balance