FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Cloud Reliability Engineer – Recovery
AlphaSenseCloud Reliability & Recovery Engineer focusing on designing and improving AWS BCP and DR capabilities at AlphaSense, a market intelligence company. Collaborates across teams for system resilience and recovery from disruptions.
Tech Stack
Tools & technologiesAWSCloudDNSDynamoDBEC2KubernetesPythonTerraform
About the role
Key responsibilities & impact- Design and implement multi-region, multi-AZ AWS architectures that meet RTO/RPO targets
- Engineer active-active and active-passive failover patterns using Route 53, Global Accelerator, and CloudFront
- Build automated DR runbooks and playbooks using AWS Systems Manager Automation and Step Functions
- Implement chaos engineering practices using AWS Fault Injection Simulator (FIS) to validate resiliency
- Architect cross-region replication strategies for S3, DynamoDB Global Tables, RDS, and Aurora Global
- Review containerized workloads using Kubernetes, ensuring resilience through self-healing, auto-scaling, and multi-cluster or multi-region deployments.
- Administer AWS Backup across all services (EC2, EBS, RDS, EFS, FSx, DynamoDB, Aurora) with policy-based automation
- Design immutable backup vaults and cross-account/cross-region backup replication pipelines
- Develop and automate data recovery testing procedures, ensuring integrity and meeting defined SLAs
- Implement point-in-time recovery (PITR) for databases and storage; validate via regular restore drills
- Maintain Business Continuity Plans (BCP) and Disaster Recovery (DR) strategies, including tracking RTO (Recovery Time Objective) and RPO (Recovery Point Objective).
- Author and maintain Terraform/CloudFormation templates for all BCP/DR infrastructure components
- Automate DR testing pipelines through CI/CD (CodePipeline, CodeBuild, GitHub Actions)
- Write Python/Bash/PowerShell scripts to orchestrate failover, failback, and health-check workflows
- Manage infrastructure state in AWS Control Tower and implement Landing Zone DR patterns
- Build CloudWatch dashboards, alarms, and composite alarms for availability and DR-readiness indicators
- Integrate AWS Health, Personal Health Dashboard events into PagerDuty/OpsGenie alerting workflows
- Participate in on-call rotations and lead DR incident response; conduct post-incident reviews (PIRs)
- Develop and maintain runbooks for AWS service degradations, regional outages, and data corruption events
- Conduct regular BCP/DR tabletop exercises and full failover simulations to validate recovery procedures and improve organizational readiness, document results and action items.
- Ensure DR controls meet SOC 2, ISO 22301, NIST 800-53, and HIPAA/PCI requirements as applicable
- Maintain current and accurate DR documentation: BIAs, BCPs, DRP runbooks, and recovery evidence
- Collaborate with audit and compliance teams to provide DR evidence and remediation tracking
Requirements
What you’ll need- 5+ years in cloud infrastructure, SRE, or IT disaster recovery engineering roles
- 3+ years of hands-on AWS experience in production environments at scale
- Proven delivery of multi-region DR architectures with defined and tested RTO/RPO targets
- Expert-level proficiency with core AWS resilience services
- Strong scripting skills: Python, Bash, or PowerShell for automation and orchestration
- Experience with Infrastructure as Code: Terraform and/or AWS CloudFormation
- Solid understanding of networking fundamentals: VPC, TGW, Direct Connect, VPN, DNS failover
- Excellent written and verbal communication; able to produce executive-level DR reports.
Benefits
Comp & perks- Competitive salary
- Remote work options
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AWSTerraformCloudFormationPythonBashPowerShellRoute 53Global AcceleratorCloudFrontKubernetes
Soft Skills
communicationleadershiporganizationalcollaborationincident responsedocumentation