
Senior Manager – SRE
Diabetes Youth Families
full-time
Posted on:
Location Type: Hybrid
Location: Massachusetts • United States
Visit company websiteExplore more
Salary
💰 $178,700 - $268,025 per year
Job Level
Tech Stack
About the role
- Lead the execution and continuous improvement of SRE practices across assigned platforms and services, reinforcing a culture of reliability, efficiency, and operational ownership
- Manage and evolve automation strategies that reduce operational toil, improve system reliability, and increase engineering productivity
- Design, implement, and operate observability, monitoring, and alerting solutions that provide actionable insight into system health, availability, and performance
- Own and lead high‑severity incident response for supported services, ensuring effective triage, coordination, root cause analysis, and completion of corrective and preventative actions
- Analyze reliability, performance, and capacity metrics to identify risks, drive proactive improvements, and support long‑term system resilience
- Partner with software engineering, product, and infrastructure teams to embed SRE principles throughout the development lifecycle and influence architecture and design decisions
- Build, coach, and develop SRE managers and engineers, fostering technical excellence, career growth, and strong on‑call and operational practices
- Support capacity planning, scalability assessments, and demand forecasting for critical systems and services
- Ensure SRE processes, standards, and best practices are well documented, understood, and consistently applied
Requirements
- Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience
- 12+ years of overall engineering experience, including 5+ years in Site Reliability Engineering, DevOps, or a similar role
- 3+ years of experience leading engineering teams or managing senior technical contributors
- Strong experience with observability and monitoring platforms such as Datadog, Prometheus, Dynatrace, Grafana, ELK, or similar
- Proficiency in at least one programming language such as Python, Go, or Java
- Hands‑on experience with cloud platforms (AWS, Azure, or GCP) and container orchestration technologies (Docker, Kubernetes)
- Solid working knowledge of AWS services such as VPC, EC2, ELB, ECS, EKS, Lambda, IAM, CloudWatch, S3, SQS, SNS, Route53, and WAF
- Experience with infrastructure‑as‑code tools such as Terraform, Ansible, or equivalents
- Strong troubleshooting and problem‑solving skills in distributed systems environments
- Working knowledge of security best practices and operational risk management
- Experience with resilience testing, chaos engineering, or failure‑injection techniques
Benefits
- Medical, dental, and vision insurance
- 401(k) with company match
- Paid time off (PTO)
- And additional employee wellness programs
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability EngineeringDevOpsobservabilitymonitoringprogramming languagePythonGoJavainfrastructure-as-coderesilience testing
Soft Skills
leadershipcoachingproblem-solvingcommunicationcollaborationanalytical skillsoperational ownershiptechnical excellencecapacity planningincident response