Leads post-incident investigations for the Site Reliability team.
Conducts in-depth post-incident analyses to identify root causes and develops preventive strategies.
Drafts clear and insightful RCAs for customer delivery.
Cross trains colleagues on how to best leverage observability tools during incident and performance investigations.
Provides visibility to all stakeholders throughout the entire Site Reliability process.
Collaborates with cross-functional teams to implement system enhancements that enhance scalability and stability.
Develops client-focused dashboards/alerts to proactively identify performance challenges.
Monitors and continuously improves our time to resolution metrics.
Maintains and configures core observability tools to ensure optimum performance and key metrics/data are available for incident response and performance investigations.
Provides an actionable feedback loop to Observability and Engineering teams toward improving MELT and development patterns.
Contributes to the development of automation tools to streamline incident response.
Works proactively to prevent incidents and reduce their impact on our platform.
Partners with the larger Cloud Operations, SRE, Engineering teams, and the business-at-large to advance our SaaS platforms.
Participates in on-call rotation with other team members as needed.
Other duties as assigned.

Requirements

Bachelor's degree in Computer Science or related field (or equivalent experience)
5+ years of proven experience in a Site Reliability Engineering role.
Strong knowledge of SRE best practices and incident management protocols
Deep experience using and/or configuring New Relic, Data Dog, SumoLogic or similar observability tools
Proficiency in reading and writing code (e.g., JavaScript, .NET, SQL)
Familiarity with cloud platforms (e.g., AWS, Azure) and architectural patterns
Excellent problem-solving skills and a data-driven approach to incident analysis
Prior experience operating within a Public Cloud environment (AWS strongly preferred)
Experience troubleshooting C#/.Net based web applications to identify bugs/performance challenges.
Solid knowledge of SaaS operations
Ability to succeed when facing ambiguity and differing levels of operational maturation
Advanced written and verbal communication skills
Windows and SQL-server troubleshooting skills preferred
Knowledge of Continuous Integration and Continuous Delivery (CI/CD) pipelines preferred
Experience working in an Infrastructure as a Code (IaC) environment preferred
Previous experience as a Software Engineer and/or System Administrator is a plus

Benefits

📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

JavaScript.NETSQLNew RelicData DogSumoLogicAWSAzureCI/CDInfrastructure as Code (IaC)

Soft Skills

problem-solvingdata-driven approachcommunicationcollaborationadaptabilitycross-trainingstakeholder visibilityfeedback provisionproactive incident preventionambiguity management