Senior Site Reliability Engineer

Fieldguide

Senior Site Reliability Engineer at Fieldguide responsible for the reliability, scalability, and observability of production systems. Collaborating with engineering teams to implement reliability standards and improve system performance.

Posted 4/30/2026full-timeRemote • California • 🇺🇸 United StatesSenior💰 $190,000 - $206,000 per yearWebsite

Tech Stack

Tools & technologies

AWSCloudDistributed SystemsGrafanaPrometheusTerraform

About the role

Key responsibilities & impact

Design and operate highly scalable, fault-tolerant systems that support production workloads across a distributed cloud environment.
Define and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to guide reliability decisions.
Build and improve observability systems (metrics, logs, tracing) to provide deep visibility into system behavior and performance.
Lead efforts to improve system reliability and performance, including capacity planning, load testing, and performance tuning.
Automate operational processes to reduce manual toil and improve system consistency and resilience.
Partner with engineering teams to design systems with reliability and scalability built in from the start.
Participate in and improve incident response, on-call practices, and post-incident reviews, focusing on root cause analysis and systemic improvements.
Drive continuous improvement of system resilience, including disaster recovery and chaos testing.
Establish best practices for monitoring, alerting, and incident management to ensure rapid detection and resolution of issues.
Advocate for reliability-focused engineering culture, including blameless postmortems and operational excellence.

Requirements

What you’ll need

5+ years of experience in site reliability engineering, infrastructure, or a related software engineering discipline
Strong experience operating and scaling distributed systems in cloud environments, with AWS preferred
Hands-on experience building and managing observability platforms (e.g., Datadog, Prometheus, Grafana, CloudWatch)
Experience defining SLOs/SLIs and leveraging them to inform and drive engineering priorities
Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent
Deep understanding of system performance, reliability patterns, and distributed system failure modes
Experience supporting production systems through on-call rotations and incident response
Proficiency in at least one programming or scripting language used for automation and tooling
Strong communication and collaboration skills, with the ability to work effectively across engineering and product teams

Benefits

Comp & perks

Competitive compensation packages with meaningful ownership
Flexible PTO
401k
Wellness benefits, including a bundle of free therapy sessions
Technology & Work from Home reimbursement
Flexible work schedules

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

site reliability engineeringdistributed systemscloud environmentsobservability platformsSLOsSLIsInfrastructure as CodeTerraformsystem performanceautomation

Soft Skills

communicationcollaborationleadershipincident responseroot cause analysiscontinuous improvementoperational excellencecapacity planningperformance tuningblameless postmortems