Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
athenahealth

Lead SRE – Observability

athenahealth

Lead Site Reliability Engineer enhancing observability and telemetry platform for athenahealth's cloud infrastructure. Collaborating with engineering teams to improve reliability and operational efficiency.

Posted 6/10/2026full-timeRemote • Massachusetts • 🇺🇸 United StatesSenior💰 $143,000 - $243,000 per yearWebsite

Tech Stack

Tools & technologies
AWSCloudDistributed SystemsElasticSearchGoGrafanaKafkaLinuxPrometheusPythonTerraform

About the role

Key responsibilities & impact
  • Build and operate scalable observability and telemetry platforms that process logs, metrics, traces, and events across production environments
  • Support monitoring, alerting, and instrumentation strategies that improve service visibility and operational insight
  • Partner with engineering teams to strengthen telemetry collection and overall observability
  • Design resilient, automated infrastructure and platform services that improve reliability, scalability, and efficiency
  • Develop Infrastructure as Code and automation solutions that reduce toil and improve consistency
  • Lead technical initiatives from architecture through implementation with attention to performance, reliability, security, and maintainability
  • Troubleshoot complex production issues involving distributed systems, Linux infrastructure, networking, cloud services, and telemetry pipelines
  • Participate in incident response and on-call processes
  • Help drive operational excellence, root cause analysis, and continuous improvement
  • Mentor engineers on SRE best practices, observability strategy, and scalable systems design
  • Contribute to long-term platform strategy and reliability improvements.

Requirements

What you’ll need
  • 7+ years of experience operating and engineering large-scale production infrastructure and distributed systems
  • Strong expertise in Linux systems engineering, cloud infrastructure, and SRE practices
  • Proven experience designing and operating observability and telemetry platforms
  • Hands-on experience with tools such as OpenSearch/Elasticsearch, Kafka, Prometheus, Grafana, Vector, Fluentd, OpenTelemetry, ClickHouse, or similar
  • Experience building Infrastructure as Code solutions using Terraform, CloudFormation, or equivalent tooling
  • Strong automation and software engineering skills using Python, Golang, or Bash
  • Experience troubleshooting large-scale distributed systems in production with a focus on availability, performance, scalability, and resiliency
  • Experience operating services in cloud-native environments, including AWS and containerized platforms
  • Strong understanding of monitoring strategy, telemetry pipelines, incident response, root cause analysis, and operational excellence
  • Ability to communicate effectively across engineering organizations and influence technical decision-making.

Benefits

Comp & perks
  • Health and financial benefits
  • Tuition assistance
  • Employee resource groups
  • Collaborative workspaces
  • Flexible work-life balance

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Linux systems engineeringcloud infrastructureSRE practicesobservability platformstelemetry platformsInfrastructure as Codeautomation solutionsPythonGolangBash
Soft Skills
communicationmentoringtechnical decision-makingleadershipcollaborationproblem-solvingincident responseroot cause analysisoperational excellencecontinuous improvement