Senior Platform Engineer, Reliability

Vizcom

full-time

Posted on: 2/24/2026

Location Type: Hybrid

Location: San Francisco • California • United States

✨ AI Apply

About the role

Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.
Set and enforce SLIs/SLOs/error budgets for critical user flows.
Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.
Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.
Own poison pill containment and workload isolation.
Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).
Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.
Gate risky deploys and enforce reliability guardrails when production health is at risk.
Establish baseline reliability metrics and identify top platform risks.
Tighten incident response mechanics (roles, comms cadence, runbooks, status updates).
Deliver high-impact hardening fixes across probes/startup paths/queue safety.
Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones.

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

SLIsSLOserror budgetsfailure isolationprobe contractsrollback standardsscaling policiesincident responseobservabilityreliability metrics

Soft Skills

calm under pressurestructured incident commanderpragmatichigh ownershipstrong written communication