
Senior Site Reliability Engineer
Empower
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $105,700 - $149,275 per year
Job Level
About the role
- Design and implement highly available, fault-tolerant systems supporting critical financial transactions.
- Architect infrastructure solutions using AWS best practices, optimizing for cost, performance, and reliability.
- Lead complex incident response efforts, coordinating across teams to restore service rapidly.
- Drive postmortem processes for high-severity incidents, ensuring action items are identified and completed.
- Establish and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key services.
- Design and implement disaster recovery strategies and business continuity plans.
- Build advanced Infrastructure as Code solutions using Terraform, including modules, workspaces, and state management.
- Architect and optimize multi-cluster EKS environments, including pod autoscaling, cluster autoscaling, and resource optimization.
- Design observability strategies using Datadog and Splunk, including metrics, dashboards, and alerting that support proactive detection.
- Implement progressive delivery mechanisms (canary and blue-green deployments) within GitOps workflows.
- Build automation frameworks that reduce operational toil and improve team efficiency.
- Partner with development teams to improve application reliability, including design reviews and architectural guidance.
- Mentor junior and intermediate SREs through coaching and code reviews.
- Contribute to architectural decisions that impact platform reliability and scalability.
- Evangelize SRE best practices across the engineering organization.
- Participate in on-call rotations and drive improvements to reduce on-call burden.
- Implement and maintain zero-trust security controls across infrastructure.
- Ensure systems meet financial services regulatory requirements and internal compliance standards.
- Conduct security reviews of infrastructure changes and deployment processes.
- Participate in audit preparations and respond to compliance-related inquiries.
Requirements
- Bachelor’s degree in Computer Science, Information Systems, or similar emphasis, or equivalent experience.
- 4 to 7 years of Site Reliability Engineering experience (or equivalent), with a track record operating large-scale production systems.
- Deep, hands-on expertise in AWS across a broad range of services and architectural patterns.
- Advanced Kubernetes knowledge, including custom resources, operators, and cluster federation concepts.
- Expert proficiency in Terraform, including module development, state management, and complex workflow orchestration.
- Strong programming skills in Python and/or Go, with ability to develop production-quality tools and services.
- Production experience implementing observability at scale using Datadog, Splunk, or similar platforms.
- Demonstrated experience establishing and maintaining CI/CD pipelines at enterprise scale.
- Deep understanding of GitOps principles and experience with tools such as ArgoCD or Flux.
- Proven ability to lead complex incident response and conduct thorough postmortems.
- Strong understanding of networking, security, and infrastructure design patterns.
- Experience mentoring engineers and conducting technical training.
Benefits
- Medical, dental, vision and life insurance
- Retirement savings – 401(k) plan with generous company matching contributions (up to 6%), financial advisory services, potential company discretionary contribution, and a broad investment lineup
- Tuition reimbursement up to $5,250/year
- Business-casual environment that includes the option to wear jeans
- Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
- Paid volunteer time — 16 hours per calendar year
- Leave of absence programs – including paid parental leave, paid short- and long-term disability, and Family and Medical Leave (FMLA)
- Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play. BRGs are open to all.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AWSKubernetesTerraformPythonGoDatadogSplunkCI/CDGitOpszero-trust security
Soft Skills
leadershipmentoringincident responsecommunicationcollaborationcoachingproblem-solvingorganizational skillsproactive detectionpostmortem analysis
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Information Systems