Senior Site Reliability Engineer

Sanity.io

SRE managing scalable content operations infrastructure for AI-powered platform. Collaborating with dev teams and ensuring reliability for high request volume systems.

Posted 7/2/2026full-timeRemote • Connecticut, Massachusetts, New Jersey, New York, Pennsylvania, Rhode Island, Vermont • 🇺🇸 United StatesSeniorWebsite

Tech Stack

Tools & technologies

CloudDistributed SystemsGoogle Cloud PlatformKubernetesPrometheus

About the role

Key responsibilities & impact

Design, build, and operate the shared platform foundations engineers ship on every day: GCP infrastructure, Kubernetes, networking, routing, CI/CD, and observability.
Diagnose and troubleshoot complex distributed systems running at high request volume.
Ensure observability and analyze the behavior of our stack.
Contribute to in-flight work like modernizing our edge, caching, and gateway layers onto Fastly and tightening observability across the platform.
Raise the reliability bar through better dashboards, alert severity, paging standards, on-call readiness, and incident response.
Make deployment boring in the best way: build golden paths, production readiness checks, safe rollouts, and useful automation so engineers have fewer places to look before they ship.
Mentor engineers and raise the technical bar through code review, design review, and pairing.
Participate in our on-call rotation and help our developer on-call rollout land well.

Requirements

What you’ll need

Based in the United States, with reasonable overlap with European engineering hours.
Experience with SRE/DevOps tools, processes, and culture.
5+ years of experience as part of an SRE on-call rotation.
Analytical approach to designing, diagnosing, and optimizing infrastructure.
Experience with managing scalable, highly available, cloud-based applications, ideally with high request volume and customer-facing uptime expectations.
Experience with Kubernetes for orchestrating, scaling, and managing containerized applications in cloud-based environments.
Experience building CI/CD pipelines.
Experience with an observability stack (Prometheus, et al.).
Comfortable working across CDNs, edge, gateways, and caching layers, or eager to go deep there.
You improve on-call and reliability by building systems, standards, and feedback loops that make production healthier over time.
You are comfortable dealing with incidents and outages and have built a practical, thoughtful communication style for handling high-pressure situations.
An open but considered approach to new technologies.

Benefits

Comp & perks

A highly-skilled, inspiring, and supportive team
Real infrastructure scale and meaningful, hands-on work changing how it runs
Positive, flexible, and trust-based work environment that encourages long-term professional and personal growth
A global, multi-culturally diverse group of colleagues and customers
Comprehensive health plans and perks
A healthy work-life balance that accommodates individual and family needs
Competitive stock options program and location-based salary

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Cloud-Based Application ManagementDistributed Systems DiagnosisProduction Readiness ChecksIncident ResponseAutomation for Deployment

Soft Skills

Analytical Problem SolvingEffective Communication in High-Pressure SituationsMentoring and Code Review