Staff Site Reliability Engineer

Fieldguide

Staff Site Reliability Engineer leading distributed systems design and evolution at Fieldguide. Overseeing reliability, scalability, and observability strategy as part of a remote-first company.

Posted 4/30/2026full-timeRemote • California • 🇺🇸 United StatesLead💰 $210,000 - $247,000 per yearWebsite

Tech Stack

Tools & technologies

AWSCloudDistributed SystemsGrafanaPrometheusTerraform

About the role

Key responsibilities & impact

Lead the design and evolution of highly scalable, fault-tolerant distributed systems across our cloud infrastructure.
Define and drive adoption of SLOs, SLIs, and error budgets across engineering teams.
Architect and continuously improve observability platforms (metrics, logging, tracing).
Own reliability strategy and roadmap, proactively identifying risks and driving long-term improvements.
Lead cross-team initiatives to improve system performance, scalability, and resilience.
Establish and enforce best practices for incident response, on-call, and operational excellence.
Drive root cause analysis and systemic improvements through blameless postmortems.
Champion automation and reduction of operational toil.
Guide capacity planning, load testing, and performance optimization efforts.
Design and validate disaster recovery, failover strategies, and resilience testing.
Mentor and coach engineers to elevate reliability engineering maturity.
Partner with Staff engineers across the organization to drive meaningful change
Partner with leadership to align business goals with reliability investments.

Requirements

What you’ll need

10+ years of experience in software engineering, with a focus on distributed systems and production infrastructure.
Extensive experience operating and scaling distributed systems in cloud environments, with a strong preference for AWS.
Deep expertise in system reliability, scalability, and performance engineering at scale.
Demonstrated experience implementing SLO-driven engineering practices and reliability frameworks.
Strong background building and owning observability ecosystems (e.g., Datadog, Prometheus, Grafana).
Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent.
Proven experience leading incident management, post-mortems, and production operations.
Strong software engineering fundamentals with the ability to contribute to and review complex codebases.
Track record of technical leadership and cross-functional influence across engineering and product teams.
Ability to balance tactical short-term needs with strategic long-term architectural improvements.
Excellent written and verbal communication skills, with the ability to translate complex technical concepts for diverse audiences.

Benefits

Comp & perks

Competitive compensation packages with meaningful ownership
Flexible PTO
401k
Wellness benefits, including a bundle of free therapy sessions
Technology & Work from Home reimbursement
Flexible work schedules

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

distributed systemscloud infrastructureSLOsSLIsobservability platformsincident managementperformance optimizationInfrastructure as CodeTerraformreliability engineering

Soft Skills

leadershipmentoringcommunicationcross-functional influencestrategic thinkingproblem-solvingcollaborationrisk managementcoachingtactical planning