FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Lead SRE – Observability
athenahealthLead Site Reliability Engineer enhancing observability and telemetry platform for athenahealth's cloud infrastructure. Collaborating with engineering teams to improve reliability and operational efficiency.
Posted 6/10/2026full-timeRemote • Massachusetts • 🇺🇸 United StatesSenior💰 $143,000 - $243,000 per yearWebsite
Tech Stack
Tools & technologiesAWSCloudDistributed SystemsElasticSearchGoGrafanaKafkaLinuxPrometheusPythonTerraform
About the role
Key responsibilities & impact- Build and operate scalable observability and telemetry platforms that process logs, metrics, traces, and events across production environments
- Support monitoring, alerting, and instrumentation strategies that improve service visibility and operational insight
- Partner with engineering teams to strengthen telemetry collection and overall observability
- Design resilient, automated infrastructure and platform services that improve reliability, scalability, and efficiency
- Develop Infrastructure as Code and automation solutions that reduce toil and improve consistency
- Lead technical initiatives from architecture through implementation with attention to performance, reliability, security, and maintainability
- Troubleshoot complex production issues involving distributed systems, Linux infrastructure, networking, cloud services, and telemetry pipelines
- Participate in incident response and on-call processes
- Help drive operational excellence, root cause analysis, and continuous improvement
- Mentor engineers on SRE best practices, observability strategy, and scalable systems design
- Contribute to long-term platform strategy and reliability improvements.
Requirements
What you’ll need- 7+ years of experience operating and engineering large-scale production infrastructure and distributed systems
- Strong expertise in Linux systems engineering, cloud infrastructure, and SRE practices
- Proven experience designing and operating observability and telemetry platforms
- Hands-on experience with tools such as OpenSearch/Elasticsearch, Kafka, Prometheus, Grafana, Vector, Fluentd, OpenTelemetry, ClickHouse, or similar
- Experience building Infrastructure as Code solutions using Terraform, CloudFormation, or equivalent tooling
- Strong automation and software engineering skills using Python, Golang, or Bash
- Experience troubleshooting large-scale distributed systems in production with a focus on availability, performance, scalability, and resiliency
- Experience operating services in cloud-native environments, including AWS and containerized platforms
- Strong understanding of monitoring strategy, telemetry pipelines, incident response, root cause analysis, and operational excellence
- Ability to communicate effectively across engineering organizations and influence technical decision-making.
Benefits
Comp & perks- Health and financial benefits
- Tuition assistance
- Employee resource groups
- Collaborative workspaces
- Flexible work-life balance
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Linux systems engineeringcloud infrastructureSRE practicesobservability platformstelemetry platformsInfrastructure as Codeautomation solutionsPythonGolangBash
Soft Skills
communicationmentoringtechnical decision-makingleadershipcollaborationproblem-solvingincident responseroot cause analysisoperational excellencecontinuous improvement