Senior Site Reliability Engineer

Adobe

Site Reliability Engineer for Adobe's Project Graph ensuring the stability of HTTP APIs and async compute platform. Collaborating with backend engineers to enhance system performance and reliability.

Posted 6/19/2026full-timeSan Jose • California, Washington • 🇺🇸 United StatesSenior💰 $159,200 - $301,600 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

site reliability engineeringinfrastructurebackend software developmentKubernetesDockerbashNode.jsTypeScriptPostgresTerraform

Soft Skills

incident responseblameless postmortemsproblem-solvingprocess improvementcommunicationcollaborationlearning agility

Tools & Technologies

CI/CDCircleCIAWSobservability toolingmetricsloggingdistributed tracingalerting

Certifications & Qualifications

Bachelor's degree in Computer Science

Industry Keywords

SLOsSLIserror budgetsasync compute platformdatabase backupdisaster recoveryRPORTOHTTP API security

Tech Stack

Tools & technologies

AWSCloudDockerJavaScriptKubernetesNode.jsPostgresRedisTerraformTypeScript

About the role

Key responsibilities & impact

Define and enforce SLOs, SLIs, and error budgets for Project Graph's HTTP APIs and async compute platform.
Build and maintain observability—metrics, logging, tracing, and alerting—so issues are caught and diagnosed quickly.
Lead incident response, run blameless postmortems, and drive the follow-up work that prevents recurrence.
Improve the reliability and scalability of an async job scheduling system built on top of Kubernetes and Postgres.
Maintain and improve CI/CD systems to keep delivery fast, safe, and reliable.
Own database data protection, backup, and resilience—including backup strategy, recovery testing, and disaster recovery planning.
Design and implement cloud infrastructure and automation to meet reliability, performance, and cost goals.
Reduce operational toil through tooling and automation, and partner with developers to build reliability in from the start.
Participate in an on-call rotation.

Requirements

What you’ll need

Bachelor's degree or equivalent experience in Computer Science.
5-10 years of experience in site reliability engineering, infrastructure, or backend software development with a strong operational focus.
Expertise with Kubernetes in production, including scaling, troubleshooting, and tuning.
Expertise with Docker and containerization.
Strong experience with bash and CI/CD tools, like CircleCI.
Strong hands-on experience in at least one server-side language; we use Node.js/TypeScript.
Experience operating data stores such as Postgres, Redis, or similar in production; we run on AWS Aurora (Postgres-compatible), so familiarity with managed/Aurora environments is a plus.
Experience with database backup, resilience, and disaster recovery—designing backup strategies, testing recovery, and meeting RPO/RTO targets.
Experience with Terraform and AWS.
Hands-on experience with observability tooling (metrics, logging, distributed tracing) and alerting.
Familiarity with HTTP API security.
A track record of incident response and a systematic, blameless approach to learning from failures.
An interest in and ability to learn new technologies.
Ability to tackle problems in a sustainable way, always striving to improve our processes and learn.
Excellent verbal and written communication skills; can effectively articulate complex ideas and influence others through well-reasoned explanations.

Benefits

Comp & perks

Health insurance
401(k) matching
Paid time off
Flexible work hours
Professional development opportunities