Arcadia

Principal Site Reliability Engineer

Arcadia

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $180,000 - $230,000 per year

Job Level

About the role

  • Act as the technical leader for reliability for one or more domains; set direction and standards while remaining hands-on where it matters most
  • Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes
  • Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation
  • Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)
  • Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk
  • Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams
  • Lead operational readiness and reliability reviews for new features/architectural changes; reinforce non-functional requirements (availability, latency, security, cost)
  • Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services
  • Champion infrastructure security best practices for environments that handle PHI (least privilege, secrets management, auditability, and defense-in-depth)
  • Mentor Staff and Senior engineers through design reviews, code reviews, pairing, and documentation; raise reliability standards across teams

Requirements

  • 8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale
  • Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations
  • Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails
  • Strong GitOps experience with Argo CD; experience building delivery workflows and automation using Argo Workflows
  • Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform; ability to define reusable platform patterns and controls
  • Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems
  • Proficiency in Python for building automation, tooling, and reliability improvements
  • Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)
  • Excellent communication skills: can translate technical risk and reliability tradeoffs to engineering leadership, product, and stakeholders; produces high-quality docs/runbooks
Benefits
  • Be a part of a mission driven company that is transforming the healthcare industry by changing the way patients receive care
  • A flexible, remote friendly company with personality and heart
  • Employee driven programs and initiatives for personal and professional development
  • Become a member of the talented, energized, diverse and purpose-driven Arcadian Community
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
SREplatform engineeringsystems engineeringKubernetesGitOpsArgo CDCrossplaneTerraformPythonAWS
Soft Skills
leadershipcommunicationmentoringincident managementcollaborationinfluencingproblem-solvingdocumentationcapacity planningperformance improvement