
Senior Site Reliability Engineer
Broadridge
full-time
Posted on:
Location Type: Hybrid
Location: Newark • California • New Jersey • United States
Visit company websiteExplore more
Salary
💰 $100,000 - $110,000 per year
Job Level
About the role
- Design and implement high-availability, fault-tolerant architectures across on-prem and cloud platforms (AWS)
- Lead multi-region DR planning, implementation, and testing, including RTO/RPO definition and validation
- Define and enforce SLOs, SLIs, and error budgets to balance reliability with delivery velocity
- Drive self-healing automation and proactive remediation strategies
- Build and maintain infrastructure using Terraform and configuration management tools (e.g., Chef)
- Develop automation to eliminate manual operational tasks (TOIL reduction)
- Create reusable modules, pipelines, and guardrails for standardized deployments
- Automate certificate lifecycle management, key rotation, and security updates
- Design and implement end-to-end observability (metrics, logs, traces, synthetic monitoring)
- Build dashboards, alerts, and runbooks to enable fast detection and resolution of incidents
- Improve signal-to-noise ratio in alerting to reduce operational fatigue
- Perform root cause analysis (RCA) and lead post-incident reviews with actionable follow-ups
- Engineer and operate platforms on AWS, including services such as: EKS, EC2, RDS/Aurora, Lambda, API Gateway, CloudFront, WAF, ALB/NLB, CloudWatch, X-Ray, IAM, Secrets Manager
- Lead cloud migrations and modernization initiatives, including legacy system refactoring
- Implement secure networking patterns (VPCs, private subnets, controlled egress)
- Identify and resolve performance bottlenecks through testing and analysis
- Drive FinOps initiatives to optimize infrastructure cost without compromising reliability
- Implement capacity planning and autoscaling strategies
- Design and support CI/CD pipelines enabling safe, repeatable deployments
- Embed reliability practices into the SDLC (testing, rollout strategies, rollback)
- Partner with development teams to improve operability of applications before production
- Partner with security and legal teams to meet regulatory and compliance requirements (e.g., data residency, GDPR-related controls)
- Implement secure access controls, secrets management, and encryption best practices
- Participate in security reviews, audits, and risk assessments
- Act as a technical leader and mentor for engineers transitioning into SRE roles
- Influence architecture and design decisions across multiple teams
- Communicate effectively with engineering leadership, product owners, and non-technical stakeholders
- Drive a culture of operational excellence, blameless postmortems, and continuous improvement
Requirements
- 3+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Systems Engineering
- Strong programming experience in Python, Java, or similar languages
- Deep experience with Linux/Unix systems
- Hands-on expertise with AWS and cloud-native architectures
- Proven experience with Terraform and Infrastructure as Code
- Strong understanding of networking, security, and distributed systems
- Experience operating mission-critical, high-volume platforms
- Preferred: Experience in financial services or highly regulated environments
- Preferred: Experience with EKS/Kubernetes at scale
- Preferred: Familiarity with Chaos Engineering and resilience testing
- Preferred: Experience leading cloud cost optimization (FinOps) initiatives
Benefits
- Bonus Eligible
- Paid sick leave in compliance with the Colorado Healthy Families and Workplaces Act
- Comprehensive benefit offerings available at www.broadridgebenefits.com
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability EngineeringPlatform EngineeringDevOpsSystems EngineeringPythonJavaLinuxTerraformInfrastructure as CodeEKS
Soft Skills
leadershipcommunicationmentoringcollaborationproblem-solvingcontinuous improvementoperational excellenceinfluenceroot cause analysispost-incident reviews