Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Fieldguide

Staff Site Reliability Engineer

Fieldguide

Staff Site Reliability Engineer leading distributed systems design and evolution at Fieldguide. Overseeing reliability, scalability, and observability strategy as part of a remote-first company.

Posted 4/30/2026full-timeRemote • California • 🇺🇸 United StatesLead💰 $210,000 - $247,000 per yearWebsite

Tech Stack

Tools & technologies
AWSCloudDistributed SystemsGrafanaPrometheusTerraform

About the role

Key responsibilities & impact
  • Lead the design and evolution of highly scalable, fault-tolerant distributed systems across our cloud infrastructure.
  • Define and drive adoption of SLOs, SLIs, and error budgets across engineering teams.
  • Architect and continuously improve observability platforms (metrics, logging, tracing).
  • Own reliability strategy and roadmap, proactively identifying risks and driving long-term improvements.
  • Lead cross-team initiatives to improve system performance, scalability, and resilience.
  • Establish and enforce best practices for incident response, on-call, and operational excellence.
  • Drive root cause analysis and systemic improvements through blameless postmortems.
  • Champion automation and reduction of operational toil.
  • Guide capacity planning, load testing, and performance optimization efforts.
  • Design and validate disaster recovery, failover strategies, and resilience testing.
  • Mentor and coach engineers to elevate reliability engineering maturity.
  • Partner with Staff engineers across the organization to drive meaningful change
  • Partner with leadership to align business goals with reliability investments.

Requirements

What you’ll need
  • 10+ years of experience in software engineering, with a focus on distributed systems and production infrastructure.
  • Extensive experience operating and scaling distributed systems in cloud environments, with a strong preference for AWS.
  • Deep expertise in system reliability, scalability, and performance engineering at scale.
  • Demonstrated experience implementing SLO-driven engineering practices and reliability frameworks.
  • Strong background building and owning observability ecosystems (e.g., Datadog, Prometheus, Grafana).
  • Proficiency with Infrastructure as Code tooling, particularly Terraform or equivalent.
  • Proven experience leading incident management, post-mortems, and production operations.
  • Strong software engineering fundamentals with the ability to contribute to and review complex codebases.
  • Track record of technical leadership and cross-functional influence across engineering and product teams.
  • Ability to balance tactical short-term needs with strategic long-term architectural improvements.
  • Excellent written and verbal communication skills, with the ability to translate complex technical concepts for diverse audiences.

Benefits

Comp & perks
  • Competitive compensation packages with meaningful ownership
  • Flexible PTO
  • 401k
  • Wellness benefits, including a bundle of free therapy sessions
  • Technology & Work from Home reimbursement
  • Flexible work schedules

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
distributed systemscloud infrastructureSLOsSLIsobservability platformsincident managementperformance optimizationInfrastructure as CodeTerraformreliability engineering
Soft Skills
leadershipmentoringcommunicationcross-functional influencestrategic thinkingproblem-solvingcollaborationrisk managementcoachingtactical planning