
IT Resiliency Engineer
AppSierra
full-time
Posted on:
Location Type: Hybrid
Location: Pune • 🇮🇳 India
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
AnsibleCloud
About the role
- Oversee the design and implementation of resilient engineering across the technology domains.
- Design and review resilient solutions in both cloud-based and on-premises environments.
- Lead chaos engineering efforts to proactively identify and mitigate potential system weaknesses.
- Collaborate with Teams to evolve existing standards for system monitoring and alerting to ensure rapid detection and response.
- Represent the IT Resiliency Office during the Architectural Review Board.
- Collaborate with various teams across the organization to align and prioritize resiliency and recovery efforts.
- Expertise with IaC and Tools such as Ansible.
- Integrate with post mortem process, from a major incident, to identify areas of opportunity for enhancing resiliency.
- Evangelize standards and practices among the Technology organization to enrich our resiliency posture.
- Develop standardized regular reporting on resilience activities, risks, and improvements to the Leadership team.
Requirements
- Bachelor's degree or equivalent experience.
- 5-10 years experience with platform engineering with a focus on IaC, DevOps practices, and orchestration tools.
- Preferred but not required experience as a Team lead or a hands-on Technical Manager role that can engage and deliver projects to completion.
- A track record of successfully architecting and deploying enterprise-level solutions that prioritize system uptime and data integrity across various operational scenarios.
- Demonstrated ability to design and implement systems that ensure high availability, support massive transaction volumes, and facilitate seamless disaster recovery processes.
- Infrastructure and service architecture & engineering experience, including functional and technical requirements gathering, and solution development.
- Strong dedication to customer needs, with excellent communication and the ability to build lasting relationships, alongside the capability to articulate complex resilience strategies in a clear and impactful manner.
- Deep insight into the complexities of multi-AZ and multi-Region cloud platforms, with a keen understanding of how these impact system resilience and disaster recovery planning.
- Proven experience in the ongoing management of mission-critical systems that require constant uptime, including out-of-hours support and rapid response to incidents.
- Knowledgeable in evaluating and deciding on trade-offs between consistency, availability, and partition tolerance, especially in the context of system failures and recovery strategies.
- Well-versed in various cloud service models such as SaaS, PaaS, and IaaS, with hands-on experience in designing resilient services on leading public cloud platforms.
- Proficient in Chaos Engineering principles and practices, with experience in designing and conducting experiments to validate the system's capability to withstand turbulent conditions.
- Skilled in implementing observability solutions that provide real-time insights into the performance and health of systems, aiding in proactive issue detection and resolution.
- Practical experience operating in an Agile development environment.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Infrastructure as Code (IaC)DevOps practicesOrchestration toolsChaos EngineeringSystem monitoringDisaster recoveryHigh availabilityCloud service models (SaaS, PaaS, IaaS)Observability solutionsPlatform engineering
Soft skills
CommunicationCustomer focusRelationship buildingLeadershipCollaborationProblem-solvingStrategic thinkingAdaptabilityProject managementTechnical management
Certifications
Bachelor's degreeTechnical Manager certification