Salary
💰 $112,700 - $160,000 per year
Tech Stack
AWSAzureCloudCyber SecuritySQL
About the role
- Lead, mentor, and work alongside a 24x7 site reliability team responsible for monitoring, incident response, and resolution of mission-critical systems
- Take hands-on ownership of incident management processes to ensure rapid detection, escalation, communication, and resolution in line with SLA targets
- Design, plan, and conduct disaster recovery (DR) drills to validate system resilience and recovery readiness
- Maintain and enforce SOC and ISO controls related to site reliability, security, and compliance
- Drive alignment and compliance with NIST cybersecurity and risk management frameworks in collaboration with security and audit teams
- Ensure the reliability and availability of complex H/A systems architected on Azure and AWS clouds
- Collaborate with development, infrastructure, and security teams to implement best practices for system reliability, automation, and scalability
- Drive continuous improvement of site reliability processes, automation, and tooling to enhance system performance and minimize downtime
- Manage capacity planning and resource allocation to sustain a resilient and responsive site reliability function
- Develop and maintain runbooks, documentation, and standards for incident response, recovery, and compliance
- Lead root cause analysis efforts and implement preventive measures to reduce recurrence of issues
Requirements
- Proven experience managing and working hands-on with SRE or 24x7 site reliability teams in a high-availability environment
- 8-15 years of relevant experience
- Bachelor Of Science Degree or equivalent work experience is highly preferred
- Expertise in incident management and ensuring system reliability for mission-critical applications
- Experience designing and executing disaster recovery drills
- Knowledge and practical experience maintaining SOC and ISO compliance controls
- Strong understanding of NIST frameworks and ability to drive organizational alignment
- Very strong working experience with Microsoft Azure or AWS cloud platforms is preferred
- Experience with NewRelic or similar monitoring and observability tools is preferred
- Hands-on familiarity with IIS, Windows server environments, and SQL databases is a plus
- Proficient with infrastructure automation, monitoring, alerting, and incident response tools
- Exceptional leadership, communication, and collaboration skills
- Demonstrated self-initiative, accountability, and a growth mindset
- Ability to thrive in a fast-paced, dynamic environment with multiple stakeholders
- Relevant cloud certifications (e.g., Azure Solutions Architect, AWS Certified SysOps Administrator) are advantageous