Site Reliability Engineer – SRE

Monstro

Site Reliability Engineer managing reliability and observability of a secure, multi-tenant platform on Google Cloud. Hands-on role focusing on incident response and reliability engineering.

Posted 6/8/2026full-timeNew York City • New York • 🇺🇸 United StatesMid-LevelSenior💰 $142,000 - $214,700 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

GCPAWSAzureKubernetesAPI gatewaysidentity systemsIaC toolsPythonGoBash

Soft Skills

on-call experienceincident managementwritten communicationpostmortem writingaction item trackingproblem-solvingreliability prioritizationautomation mindsetdashboard disciplinealert hygiene

Tools & Technologies

Google Cloud MonitoringBigQuerystructured loggingdistributed tracingrunbooksincident bridgealert routingerror budgetsproduction alertsaudit evidence

Industry Keywords

SLOsSLIsobservabilityincident communicationtoil reductionservice reliabilityactionable alertsclean handoffsseverity triagemitigation strategies

Tech Stack

Tools & technologies

AWSAzureBigQueryCloudGoGoogle Cloud PlatformKubernetesPython

About the role

Key responsibilities & impact

Define and maintain SLOs and SLIs for our tier-1 services: API gateway, application services, identity, and edge availability
Build canonical dashboards and alerts in Google Cloud Monitoring, backed by structured logs and BigQuery log analytics
Tune alert routing so every page is actionable — kill the rest
Instrument services for distributed tracing and structured logging; push back on services that ship without it
Own error budgets and use them to prioritize reliability work over feature work when burned
Reduce toil: automate the top recurring page from the previous quarter
Maintain runbooks so every page maps to one within a cycle of first occurrence
First responder for production alerts across monitoring, API gateway, edge defense, and CI
Triage severity, run the incident bridge, drive mitigation (revision rollback, traffic shift, scaling, edge block, credential rotation)
Own internal and external incident comms during your shift
Drive postmortems to closure with action items tracked as audit evidence
Clean written handoffs at end of shift

Requirements

What you’ll need

Solid production experience on GCP (or comparable AWS/Azure depth with willingness to ramp on GCP fast)
Comfortable on-call: you’ve run incidents, written postmortems, and shipped the action items
Strong observability fundamentals: SLOs, log-based metrics, alert hygiene, dashboard discipline
Working knowledge of Kubernetes, API gateways, identity systems, and at least one IaC tool
Scripting / coding fluency (Python, Go, Bash) for automation and tooling
Good written communication — handoffs, postmortems, and runbooks are part of the job
Bias toward fixing the system, not the symptoms

Benefits

Comp & perks

Competitive salary
Equity
Paid health, vision, dental, and disability coverage