Cloud Reliability Engineer – Recovery

AlphaSense

Cloud Reliability & Recovery Engineer focusing on designing and improving AWS BCP and DR capabilities at AlphaSense, a market intelligence company. Collaborates across teams for system resilience and recovery from disruptions.

Posted 5/5/2026full-timeRemote • 🇮🇳 IndiaMid-LevelSeniorWebsite

Tech Stack

Tools & technologies

AWSCloudDNSDynamoDBEC2KubernetesPythonTerraform

About the role

Key responsibilities & impact

Design and implement multi-region, multi-AZ AWS architectures that meet RTO/RPO targets
Engineer active-active and active-passive failover patterns using Route 53, Global Accelerator, and CloudFront
Build automated DR runbooks and playbooks using AWS Systems Manager Automation and Step Functions
Implement chaos engineering practices using AWS Fault Injection Simulator (FIS) to validate resiliency
Architect cross-region replication strategies for S3, DynamoDB Global Tables, RDS, and Aurora Global
Review containerized workloads using Kubernetes, ensuring resilience through self-healing, auto-scaling, and multi-cluster or multi-region deployments.
Administer AWS Backup across all services (EC2, EBS, RDS, EFS, FSx, DynamoDB, Aurora) with policy-based automation
Design immutable backup vaults and cross-account/cross-region backup replication pipelines
Develop and automate data recovery testing procedures, ensuring integrity and meeting defined SLAs
Implement point-in-time recovery (PITR) for databases and storage; validate via regular restore drills
Maintain Business Continuity Plans (BCP) and Disaster Recovery (DR) strategies, including tracking RTO (Recovery Time Objective) and RPO (Recovery Point Objective).
Author and maintain Terraform/CloudFormation templates for all BCP/DR infrastructure components
Automate DR testing pipelines through CI/CD (CodePipeline, CodeBuild, GitHub Actions)
Write Python/Bash/PowerShell scripts to orchestrate failover, failback, and health-check workflows
Manage infrastructure state in AWS Control Tower and implement Landing Zone DR patterns
Build CloudWatch dashboards, alarms, and composite alarms for availability and DR-readiness indicators
Integrate AWS Health, Personal Health Dashboard events into PagerDuty/OpsGenie alerting workflows
Participate in on-call rotations and lead DR incident response; conduct post-incident reviews (PIRs)
Develop and maintain runbooks for AWS service degradations, regional outages, and data corruption events
Conduct regular BCP/DR tabletop exercises and full failover simulations to validate recovery procedures and improve organizational readiness, document results and action items.
Ensure DR controls meet SOC 2, ISO 22301, NIST 800-53, and HIPAA/PCI requirements as applicable
Maintain current and accurate DR documentation: BIAs, BCPs, DRP runbooks, and recovery evidence
Collaborate with audit and compliance teams to provide DR evidence and remediation tracking

Requirements

What you’ll need

5+ years in cloud infrastructure, SRE, or IT disaster recovery engineering roles
3+ years of hands-on AWS experience in production environments at scale
Proven delivery of multi-region DR architectures with defined and tested RTO/RPO targets
Expert-level proficiency with core AWS resilience services
Strong scripting skills: Python, Bash, or PowerShell for automation and orchestration
Experience with Infrastructure as Code: Terraform and/or AWS CloudFormation
Solid understanding of networking fundamentals: VPC, TGW, Direct Connect, VPN, DNS failover
Excellent written and verbal communication; able to produce executive-level DR reports.

Benefits

Comp & perks

Competitive salary
Remote work options

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

AWSTerraformCloudFormationPythonBashPowerShellRoute 53Global AcceleratorCloudFrontKubernetes

Soft Skills

communicationleadershiporganizationalcollaborationincident responsedocumentation