Empower

Senior Site Reliability Engineer

Empower

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $105,700 - $149,275 per year

Job Level

About the role

  • Design and implement highly available, fault-tolerant systems supporting critical financial transactions.
  • Architect infrastructure solutions using AWS best practices, optimizing for cost, performance, and reliability.
  • Lead complex incident response efforts, coordinating across teams to restore service rapidly.
  • Drive postmortem processes for high-severity incidents, ensuring action items are identified and completed.
  • Establish and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key services.
  • Design and implement disaster recovery strategies and business continuity plans.
  • Build advanced Infrastructure as Code solutions using Terraform, including modules, workspaces, and state management.
  • Architect and optimize multi-cluster EKS environments, including pod autoscaling, cluster autoscaling, and resource optimization.
  • Design observability strategies using Datadog and Splunk, including metrics, dashboards, and alerting that support proactive detection.
  • Implement progressive delivery mechanisms (canary and blue-green deployments) within GitOps workflows.
  • Build automation frameworks that reduce operational toil and improve team efficiency.
  • Partner with development teams to improve application reliability, including design reviews and architectural guidance.
  • Mentor junior and intermediate SREs through coaching and code reviews.
  • Contribute to architectural decisions that impact platform reliability and scalability.
  • Evangelize SRE best practices across the engineering organization.
  • Participate in on-call rotations and drive improvements to reduce on-call burden.
  • Implement and maintain zero-trust security controls across infrastructure.
  • Ensure systems meet financial services regulatory requirements and internal compliance standards.
  • Conduct security reviews of infrastructure changes and deployment processes.
  • Participate in audit preparations and respond to compliance-related inquiries.

Requirements

  • Bachelor’s degree in Computer Science, Information Systems, or similar emphasis, or equivalent experience.
  • 4 to 7 years of Site Reliability Engineering experience (or equivalent), with a track record operating large-scale production systems.
  • Deep, hands-on expertise in AWS across a broad range of services and architectural patterns.
  • Advanced Kubernetes knowledge, including custom resources, operators, and cluster federation concepts.
  • Expert proficiency in Terraform, including module development, state management, and complex workflow orchestration.
  • Strong programming skills in Python and/or Go, with ability to develop production-quality tools and services.
  • Production experience implementing observability at scale using Datadog, Splunk, or similar platforms.
  • Demonstrated experience establishing and maintaining CI/CD pipelines at enterprise scale.
  • Deep understanding of GitOps principles and experience with tools such as ArgoCD or Flux.
  • Proven ability to lead complex incident response and conduct thorough postmortems.
  • Strong understanding of networking, security, and infrastructure design patterns.
  • Experience mentoring engineers and conducting technical training.
Benefits
  • Medical, dental, vision and life insurance
  • Retirement savings – 401(k) plan with generous company matching contributions (up to 6%), financial advisory services, potential company discretionary contribution, and a broad investment lineup
  • Tuition reimbursement up to $5,250/year
  • Business-casual environment that includes the option to wear jeans
  • Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
  • Paid volunteer time — 16 hours per calendar year
  • Leave of absence programs – including paid parental leave, paid short- and long-term disability, and Family and Medical Leave (FMLA)
  • Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play. BRGs are open to all.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
AWSKubernetesTerraformPythonGoDatadogSplunkCI/CDGitOpszero-trust security
Soft Skills
leadershipmentoringincident responsecommunicationcollaborationcoachingproblem-solvingorganizational skillsproactive detectionpostmortem analysis
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Information Systems