Vizcom

Senior Platform Engineer, Reliability

Vizcom

full-time

Posted on:

Location Type: Hybrid

Location: San FranciscoCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.
  • Set and enforce SLIs/SLOs/error budgets for critical user flows.
  • Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.
  • Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.
  • Own poison pill containment and workload isolation.
  • Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).
  • Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.
  • Gate risky deploys and enforce reliability guardrails when production health is at risk.
  • Establish baseline reliability metrics and identify top platform risks.
  • Tighten incident response mechanics (roles, comms cadence, runbooks, status updates).
  • Deliver high-impact hardening fixes across probes/startup paths/queue safety.
  • Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones.

Requirements

  • Calm, structured incident commander under pressure.
  • Thinks in failure modes and blast radius by default.
  • Pragmatic: can stabilize quickly, then implement durable fixes.
  • High ownership and strong written communication.
Benefits
  • Health insurance
  • Retirement plans
  • Paid time off
  • Flexible work arrangements
  • Professional development
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
SLIsSLOserror budgetsfailure isolationprobe contractsrollback standardsscaling policiesincident responseobservabilityreliability metrics
Soft Skills
calm under pressurestructured incident commanderpragmatichigh ownershipstrong written communication