Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Monstro

Site Reliability Engineer – SRE

Monstro

Site Reliability Engineer managing reliability and observability of a secure, multi-tenant platform on Google Cloud. Hands-on role focusing on incident response and reliability engineering.

Posted 6/9/2026full-timeNew York City • New York • 🇺🇸 United StatesMid-LevelSenior💰 $142,000 - $214,700 per yearWebsite

Tech Stack

Tools & technologies
AWSAzureBigQueryCloudGoGoogle Cloud PlatformKubernetesPython

About the role

Key responsibilities & impact
  • Define and maintain SLOs and SLIs for our tier-1 services: API gateway, application services, identity, and edge availability
  • Build canonical dashboards and alerts in Google Cloud Monitoring, backed by structured logs and BigQuery log analytics
  • Tune alert routing so every page is actionable — kill the rest
  • Instrument services for distributed tracing and structured logging; push back on services that ship without it
  • Own error budgets and use them to prioritize reliability work over feature work when burned
  • Reduce toil: automate the top recurring page from the previous quarter
  • Maintain runbooks so every page maps to one within a cycle of first occurrence
  • First responder for production alerts across monitoring, API gateway, edge defense, and CI
  • Triage severity, run the incident bridge, drive mitigation (revision rollback, traffic shift, scaling, edge block, credential rotation)
  • Own internal and external incident comms during your shift
  • Drive postmortems to closure with action items tracked as audit evidence
  • Clean written handoffs at end of shift

Requirements

What you’ll need
  • Solid production experience on GCP (or comparable AWS/Azure depth with willingness to ramp on GCP fast)
  • Comfortable on-call: you’ve run incidents, written postmortems, and shipped the action items
  • Strong observability fundamentals: SLOs, log-based metrics, alert hygiene, dashboard discipline
  • Working knowledge of Kubernetes, API gateways, identity systems, and at least one IaC tool
  • Scripting / coding fluency (Python, Go, Bash) for automation and tooling
  • Good written communication — handoffs, postmortems, and runbooks are part of the job
  • Bias toward fixing the system, not the symptoms

Benefits

Comp & perks
  • Competitive salary
  • Equity
  • Paid health, vision, dental, and disability coverage

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
GCPAWSAzureKubernetesAPI gatewaysidentity systemsIaC toolsPythonGoBash
Soft Skills
on-call experienceincident managementwritten communicationpostmortem writingaction item trackingproblem-solvingreliability prioritizationautomation mindsetdashboard disciplinealert hygiene