
Site Reliability Engineer
Empower
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $87,400 - $123,400 per year
About the role
- Own and improve the reliability, stability, scalability, and performance of our core data platforms and services
- Provide operational support for large-scale, distributed data systems, ensuring high availability and strong SLAs
- Partner closely with full-stack, data, and platform engineering teams to deliver continuous improvements
- Operate and support EMR and EMR Serverless (Python/Spark) workloads and data pipelines
- Support and optimize Amazon Redshift and DynamoDB in high-throughput, production environments
- Design, build, and evolve monitoring, alerting, and observability frameworks with a focus on symptoms, not just outages
- Lead incident response, troubleshooting production issues across the full stack and coordinating with internal and external stakeholders
- Perform root cause analysis (RCA) and readiness reviews; turn findings into durable fixes and automation
- Create and maintain runbooks, SOPs, and operational documentation
- Collaborate with engineering teams to optimize performance, reliability, and cost
- Participate in an on-call rotation to respond to incidents impacting customer-facing systems
- Recommend and influence the use of AWS managed services and architectural patterns
- Continuously evaluate system performance, capacity, and cost to scale efficiently
Requirements
- 4–6 years of experience building or operating systems across multiple architecture domains: application, data, integration, infrastructure, and security
- 4+ years of hands-on AWS experience, with strong production exposure to several of the following: Redshift, DynamoDB, EMR, EMR Serverless, EC2, S3 Lambda, Step Functions, EventBridge, RDS, IAM
- Proven experience operating data platforms such as data lakes and data warehouses in production
- Strong SQL skills and experience working with modern databases (e.g., Redshift, DynamoDB, Postgres, MySQL, Oracle)
- 4+ years of Python experience, including scripting, automation, or data workloads
- Experience with CloudWatch, infrastructure monitoring, and alerting
- Hands-on experience with incident management, uptime SLAs, and customer-impacting systems
- Strong understanding of Git-based workflows (GitHub, Git Flow, or similar)
- Experience working in Agile environments (Scrum / Kanban) using tools such as Jira and Confluence
- Bachelor’s in Computer Science, Information Systems, Data/Analytics, or related; equivalent practical experience welcomed.
Benefits
- Medical, dental, vision and life insurance
- Retirement savings – 401(k) plan with generous company matching contributions (up to 6%)
- Tuition reimbursement up to $5,250/year
- Business-casual environment that includes the option to wear jeans
- Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
- Paid volunteer time — 16 hours per calendar year
- Leave of absence programs – including paid parental leave, paid short- and long-term disability, and Family and Medical Leave (FMLA)
- Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonSQLAWSRedshiftDynamoDBEMREMR ServerlessCloudWatchGitAgile
Soft Skills
incident managementtroubleshootingcollaborationcommunicationproblem-solvingroot cause analysisdocumentationleadershiporganizational skillscustomer focus