Plentiful.ai

Site Reliability Engineer

Plentiful.ai

full-time

Posted on:

Location Type: Hybrid

Location: San FranciscoCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Maintain and evolve alerting so engineers receive clear, actionable signals for anomalies, latency regressions and reliability risks
  • Define observability standards across metrics, logs and tracing with a focus on reliability, performance and customer impact instead of vanity data
  • Investigate performance bottlenecks across our distributed systems including serverless task execution, containerized services, workflow orchestration and Postgres
  • Lead incident response, coordinate root cause analysis and ensure reliability improvements are fully implemented and measured
  • Improve the reliability of our distributed task processing, including autoscaling behavior, execution patterns, retry logic, rate limiting and failure isolation
  • Support the stability of our serverless pipelines that process high volume workloads across multiple execution layers
  • Partner with backend and ML teams on designing resilient mechanisms for scheduling, queueing and workflow execution
  • Maintain efficient and predictable resource usage across compute, networking and storage
  • Support security and compliance work including patching, audit readiness and vulnerability management
  • Participate in the on-call rotation and respond to production incidents quickly and calmly with a focus on restoring stable service and clear communication
  • Contribute to blameless postmortems, drive follow through on fixes and ensure learnings are documented for future engineers

Requirements

  • 5+ years of professional engineering experience in a B2B, SaaS company
  • Strong experience operating production systems in cloud environments, ideally AWS
  • Hands-on experience with serverless compute patterns, containerized services, distributed workflows and Postgres
  • Solid understanding of observability tooling, performance debugging and system behavior under load
  • A high ownership mindset, empathy for teammates, straightforward communication and a one team attitude
  • Comfortable working in a fast paced startup environment with a bias for action and thoughtful engineering judgment
Benefits
  • Enjoy unlimited PTO
  • Fully covered health insurance (medical, dental, and vision)
  • Meal stipend
  • Health & wellness stipend
  • 401(k) matching
  • Stock options
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
observability standardsperformance debuggingserverless compute patternscontainerized servicesdistributed workflowsPostgresincident responseroot cause analysisreliability improvementsresource usage
Soft Skills
high ownership mindsetempathystraightforward communicationteam collaborationbias for actionthoughtful engineering judgment