
Senior Site Reliability Engineer
HavocAI
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $150,000 - $185,000 per year
Job Level
About the role
- Design and evolve reliability architecture for distributed and cloud-hosted systems.
- Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning.
- Partner with platform and application teams to design systems for reliability, scalability, and operability.
- Identify and mitigate systemic reliability risks across infrastructure and services.
- Lead incident response processes including on-call rotations, escalation, and post-incident reviews.
- Conduct root cause analysis for complex production incidents and drive long-term improvements.
- Improve operational readiness through runbooks, automation, and resilience testing.
- Reduce operational toil through tooling, automation, and process improvements.
- Design and maintain observability systems for metrics, logging, tracing, and alerting.
- Ensure services and data pipelines are observable, debuggable, and performant in production.
- Drive performance analysis and tuning across infrastructure and service layers.
- Build automation to improve system reliability, deployment safety, and recovery processes.
- Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns.
- Support and improve Kubernetes-based environments and containerized workloads.
- Collaborate with security teams to ensure secure and resilient system design.
- Participate in disaster recovery planning and testing.
- Maintain strong operational practices around access control, secrets management, and change management.
Requirements
- 7+ years of experience in SRE, infrastructure, or systems engineering roles
- Strong experience operating large-scale distributed production systems
- Deep understanding of Linux systems, networking, and distributed systems fundamentals
- Hands-on experience with Kubernetes and container orchestration
- Programming or scripting experience in Go, Python, or similar languages
- Experience designing and operating observability systems for production environments
- Proven ability to lead incident response and reliability improvements
- Strong communication skills and ability to collaborate across engineering teams
- Must be a US Citizen.
- Must be Eligible to obtain a Government Clearance - if required.
Benefits
- 100% Employer paid Health, Dental and Vision Insurance for you and your families
- Life Insurance (Employer Paid)
- Ability to participate in the companies 401k program (Matching)
- Unlimited PTO policy with an enforced 2 week minimum
- Equity Package
- Work / Home Office Stipend
- Global Entry
- 16 Week Paid Parental Leave
- Monthly Health and Wellness Stipend
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
SRE best practicesSLIsSLOserror budgetscapacity planningroot cause analysisautomationobservability systemsKubernetesprogramming in Go
Soft Skills
lead incident responsestrong communicationcollaboration
Certifications
Government Clearance eligibility