Optimal Ways

Staff Software Engineer, Site Reliability, SRE

Optimal Ways

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Manual Apply

Salary

💰 $160,000 - $200,000 per year

Job Level

Lead

Tech Stack

AWSCloudDynamoDBJavaJavaScriptKubernetesPythonSDLCTerraformTypeScript

About the role

  • Reliability: Own the company-wide incident lifecycle; standards for detection, escalation, incident command, customer comms, and high-quality postmortems with action tracking.
  • Define and drive SLIs/SLOs for core services; build guardrails and dashboards that make reliability visible and actionable.
  • Lead production readiness reviews, capacity/performance planning, load testing, disaster recovery exercises, and resilience engineering (failure testing/chaos where appropriate).
  • Level-up on-call: right-sizing rotations, paging hygiene, runbooks, auto-remediation, and continuous improvement of MTTA/MTTR.
  • Security: Embed security into the delivery pipeline: dependency and image scanning, least-privilege/IAM baselines, secrets management, and service-to-service auth.
  • SOC 2-aligned controls as code; audit-friendly evidence generation in everyday engineering.
  • Drive secure-by-default patterns in the platform (network posture, data protection, runtime policies).
  • Platform & DevEx: Build and evolve paved roads for deploys, config, and runtime operations in our monorepo (Bazel) and CI/CD (AWS CodePipeline/CodeBuild).
  • Partner with product teams to make the secure default the easiest path—templates, tooling, libraries, and automation.
  • Improve observability end-to-end (traces, logs, metrics, alerts).

Requirements

  • Experienced: Staff-level IC who has led reliability programs at meaningful scale and owned incident response standards.
  • Technically Grounded: Deep, hands-on experience with infrastructure at scale, cloud, containerization, and more:
  • AWS (multi-service)
  • ECS and/or Kubernetes containerization workloads
  • CICD & IaC (Terraform)
  • Production Networking/Fundamentals
  • Python Proficient: You can read/review service code and land operational improvements.
  • Data Driven: In your approach to SLOs, capacity, performance, and cost efficiency with strong observability chops
  • Influential: Able to shape direction and create simple, durable standards
  • Communicative: Excels in both technical and interpersonal communication, with strong written and verbal skills
  • Nice To Have: FinOps, SOC 2, Data Science/ML collaboration, monorepo frameworks (bazel, buck)