
Staff Site Reliability Engineer
SmarterDx
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $230,000 - $250,000 per year
Job Level
About the role
- Define and evolve reliability standards for the SmarterDx platform, including SLIs, SLOs, and error budgets that align engineering work with customer impact.
- Implement a “reliability” platform using Terraform and infrastructure-as-code best practices.
- Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR).
- Lead incident response, drive blameless postmortems, and implement systemic improvements to prevent recurrence.
- Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms.
- Provide production support for the SmarterDx platform, applying SRE principles to ensure availability, performance, and data durability.
- Research,prototype, and advocate for new reliability practices, tooling, and architectural improvements across the engineering organization.
Requirements
- 10+ years of software and software reliability engineering experience, with significant time spent operating and scaling distributed systems in production environments.
- 3+ years of hands-on experience running cloud-native infrastructure in AWS, including deep familiarity with containers, Kubernetes, monitoring, and alerting in live production systems.
- Proven experience defining and managing SLIs/SLOs, leading incident response, and driving postmortems and systemic reliability improvements.
- Strong expertise with Terraform and infrastructure-as-code practices for managing production infrastructure safely and reproducibly.
- Deep experience with Kubernetes architecture and operations, including workload reliability, cluster scaling, networking, and failure modes.
- Experience working in security-conscious, compliance-oriented environments where reliability and data protection are first-class concerns.
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field — or equivalent practical experience operating large-scale systems.
Benefits
- Medical, Dental & Vision – Comprehensive plans with leading insurance providers, covering 75% of your premiums, depending on the plan.
- Paid Parental Leave – Generous paid leave to support families through birth or adoption: Up to 12 weeks for parents.
- Remote-First Team – Work from anywhere in the U.S.
- Unlimited PTO & 10 Holidays – So you can relax and recharge.
- 401(k) with Traditional & Roth Options – Tax-advantaged retirement savings through Fidelity with a 4% match.
- Minimal Bureaucracy – A fast-moving, high-impact environment where you can focus on what matters.
- Incredible Teammates! – Work alongside smart, supportive, and mission-driven colleagues.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
software reliability engineeringdistributed systemscloud-native infrastructureAWScontainersKubernetesTerraforminfrastructure-as-codeSLIsSLOs
Soft Skills
incident responseblameless postmortemssystemic improvementsautomationself-healing systemsproduction supportresearchadvocacy
Certifications
Bachelor’s degree in Computer ScienceMaster’s degree in Engineering