
Senior Platform Engineer, Reliability
Vizcom
full-time
Posted on:
Location Type: Hybrid
Location: San Francisco • California • United States
Visit company websiteExplore more
Job Level
About the role
- Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.
- Set and enforce SLIs/SLOs/error budgets for critical user flows.
- Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.
- Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.
- Own poison pill containment and workload isolation.
- Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).
- Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.
- Gate risky deploys and enforce reliability guardrails when production health is at risk.
- Establish baseline reliability metrics and identify top platform risks.
- Tighten incident response mechanics (roles, comms cadence, runbooks, status updates).
- Deliver high-impact hardening fixes across probes/startup paths/queue safety.
- Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones.
Requirements
- Calm, structured incident commander under pressure.
- Thinks in failure modes and blast radius by default.
- Pragmatic: can stabilize quickly, then implement durable fixes.
- High ownership and strong written communication.
Benefits
- Health insurance
- Retirement plans
- Paid time off
- Flexible work arrangements
- Professional development
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
SLIsSLOserror budgetsfailure isolationprobe contractsrollback standardsscaling policiesincident responseobservabilityreliability metrics
Soft Skills
calm under pressurestructured incident commanderpragmatichigh ownershipstrong written communication