
Engineering Manager – SRE
AuthZed
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
About the role
- Lead a global team of Site Reliability Engineers delivering infrastructure automation, observability, and operational scalability across multi-cloud and multi-region kubernetes based architectures.
- Recruit, hire, onboard and develop engineers while elevating the overall strength of the team.
- Act as a player coach by contributing to critical projects while mentoring engineers and supporting their professional growth.
- Participate in on-call rotations at a sustainable level to stay grounded in real operational issues.
- Guide project planning by defining milestones, identifying dependencies, and working toward timely and meaningful delivery.
- Identify toil and lead initiatives to eliminate it through engineering solutions.
- Drive automation and platform engineering: safer deploys, progressive delivery, guardrails, and paved paths that reduce toil.
- Collaborate with product and engineering to ship features like self-service workflows and infra-as-code expectations with reliability baked in.
- Serve as a senior escalation point for complex incident triage and root cause analysis.
Requirements
- 10+ years of experience in infrastructure, SRE, or platform engineering roles.
- 5+ years of team management or technical leadership in SRE or Platform Engineering.
- Experience managing distributed teams across US, Canada, EU, and global time zones.
- Experience leading or mentoring SRE/Infrastructure/Platform teams in a production SaaS environment.
- Strong grasp of SRE fundamentals: SLOs/SLIs, error budgets, incident management, capacity planning, and operational excellence.
- Extensive experience with AWS, GCP and Azure managed services.
- Strong programming skills and experience writing production-quality automation or tooling (e.g., Go, Python, Bash).
- Hands-on experience with Kubernetes, Kubernetes Operators/Controllers, containerized workloads, and Infrastructure as Code (Terraform, Pulumi).
- Experience with monitoring and observability systems (e.g., Prometheus, Grafana, logging/tracing pipelines).
- Excellent communication: can translate reliability tradeoffs to product/exec stakeholders and write crisp incident/postmortem artifacts.
- Proven ability to translate operational pain points into engineering deliverables.
Benefits
- Comprehensive benefits including healthcare (US-based) and other insurance.
- Stock options at an early-stage startup.
- Competitive salary based on experience.
- A full remote and flexible schedule to accommodate different timezones.
- Twice-yearly travel for team offsites focused on team bonding, collaboration, and having fun!
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
infrastructure engineeringsite reliability engineeringplatform engineeringSLOsSLIserror budgetsincident managementcapacity planningautomationprogramming
Soft Skills
team managementtechnical leadershipmentoringcommunicationcollaborationproject planningproblem-solvingoperational excellenceprofessional growthincident triage