Senior Cloud Resilience Architect

Blink Health

. Evaluate and mature the organization’s disaster recovery posture, including recovery objectives (RTO/RPO), dependency mapping, and failure domain analysis across applications, data, and infrastructure.

Posted 5/6/2026full-timeRemote • New York • 🇺🇸 United StatesSeniorWebsite

Tech Stack

Tools & technologies

AnsibleAWSAzureCloudDNSGoogle Cloud PlatformKubernetesTerraform

About the role

Key responsibilities & impact

Evaluate and mature the organization’s disaster recovery posture, including recovery objectives (RTO/RPO), dependency mapping, and failure domain analysis across applications, data, and infrastructure.
Define, document, and establish disaster recovery standards and best practices across cloud infrastructure, platforms, and application architectures.
Partner with SRE, platform, security, and product engineering teams to design and implement resilient, fault-tolerant systems, progressing from backup-based recovery to multi-region and active-active architectures.
Lead the disaster recovery roadmap, balancing technical feasibility, cost, risk, and business priorities.
Design and recommend reference architectures for disaster recovery patterns, including pilot-light, warm standby, hot standby, and active-active.
Drive adoption of active-active disaster recovery for critical systems, including traffic management, data replication, consistency models, and automated failover.
Define and operationalize testing strategies for DR, including game days, chaos testing, and regular recovery exercises.
Establish clear documentation, runbooks, and escalation paths to ensure recoverability is well understood and not dependent on individuals.
Evaluate and recommend platform upgrades, cloud services, and tooling that improve resilience, recovery speed, and reliability.
Serve as a technical authority and advisor on disaster recovery and resilience for leadership and engineering teams.
Provide architectural guidance, design reviews, and mentorship to engineers implementing DR-related changes.
Partner with security and compliance teams to ensure DR strategies meet regulatory, audit, and data protection requirements.

Requirements

What you’ll need

Bachelor’s or Master’s degree in Computer Science or equivalent practical experience.
8+ years of experience in cloud infrastructure, platform engineering, SRE, or reliability-focused architecture roles.
Deep understanding of disaster recovery concepts including RTO/RPO, blast radius reduction, failure domains, and dependency isolation.
Proven experience designing and implementing multi-region and multi-availability zone architectures.
Hands-on experience moving systems toward active-active or highly available architectures.
Strong grasp of data replication strategies, consistency tradeoffs, and recovery patterns for databases and stateful systems.
Extensive experience with major cloud providers (AWS preferred, GCP/Azure acceptable).
Strong understanding of managed cloud services and their DR characteristics and limitations.
Experience with Kubernetes-based platforms, including regional failover, workload portability, and cluster recovery strategies.
Familiarity with global traffic management, DNS, load balancing, and service mesh patterns.
Experience designing and maintaining Infrastructure as Code using tools such as Terraform, Pulumi, CloudFormation, or Ansible.
Strong focus on automation for recovery workflows, failover testing, and environment provisioning.
Ability to eliminate manual recovery steps and reduce time-to-recovery through software.
Experience defining and running DR tests, game days, and failure simulations.
Comfortable working across organizational boundaries to influence priorities and standards.
Strong documentation and communication skills, with the ability to translate complex technical risk into business impact.

Benefits

Comp & perks

Health insurance
Remote work flexibility
Professional development
Paid time off

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

disaster recoveryRTORPOdependency mappingfailure domain analysisdata replicationconsistency modelsInfrastructure as CodeKubernetescloud architecture

Soft Skills

leadershipcommunicationdocumentationmentorshipinfluencecollaborationorganizational skillstechnical authorityproblem-solvingstrategic thinking