
Resilience Engineer
Vodafone
full-time
Posted on:
Location Type: Hybrid
Location: Lisboa • Portugal
Visit company websiteExplore more
About the role
- Developing and governing resilience strategies across system architecture, deployment, monitoring, and incident response;
- Defining and tracking stability KPIs (e.g., MTTD, MTTR, error budgets), partnering with performance and operations teams to meet or exceed targets;
- Designing and implementing fault injection testing, chaos engineering practices, and scenario-based simulations to validate platform robustness;
- Collaborating with product, infrastructure, architecture and development teams to re-design services with built-in redundancy, failover, and graceful degradation;
- Driving automation and observability improvements to reduce noise, increase fault detection speed, and support predictive failure mitigation;
- Contributing to the design and maintenance of our Business Continuity and Disaster Recovery Plan (BCDR), ensuring IoT systems remain resilient and recoverable in the face of unexpected disruptions;
- Owning the resilience roadmap and continuously assessing emerging threats, technologies, and architectural shifts to guide evolution of stability practices;
- Evangelizing a culture of resilience through internal communication, workshops, and post-incident learning programs;
- Deliver high-quality engineering solutions while continuously strengthening the resilience, scalability, and cost efficiency of our IoT platform;
- Consistently meet or exceed delivery expectations by prioritizing the highest-leverage resilience initiatives that improve customer experience, business outcomes, and financial performance;
- Build trusted, transparent, and outcome-driven relationships by providing clear technical direction and trade-off recommendations to business and engineering stakeholders.
Requirements
- Educated to BSc degree level in Software Engineer or related discipline with Computer Science
- Strong scripting and automation experience (e.g., Python, Bash, Go, PowerShell), with a demonstrated ability to replace manual processes with reliable, scalable automation;
- Proven experience designing and operating high-availability, fault-tolerant systems, including the use of chaos engineering techniques and proactive failure-mitigation strategies;
- Experience applying Business Continuity and resilience standards (e.g., ISO 22301) in the context of real-world platform design and operational readiness;
- Hands-on experience designing or integrating monitoring, alerting, and automated testing frameworks to support early fault detection and system validation;
- Broad experience working with Linux-based platforms across on-premises and cloud environments, with an understanding of how infrastructure choices impact reliability, scalability, and recovery;
- Deep expertise in Site Reliability Engineering principles, including SLOs/SLIs, error budgets, observability, toil reduction, and automation, with the ability to apply them at platform and system scale to guide architectural decisions and long-term resilience strategy;
- Proven ability to balance long-term platform stability with delivery velocity by making clear, data-driven trade-offs;
- Strong understanding of security principles, practices, and standards, and the ability to incorporate them into resilient, real-world technical solutions;
- Deep command of telemetry, logging, and alerting ecosystems (e.g., Prometheus, Grafana, ELK, Datadog, Splunk), with the ability to design signals that enable early fault detection and informed decision-making;
- Experience defining meaningful SLIs and building dashboards that drive architectural insight, prioritization, and corrective action;
- Proven experience leading blameless post-incident reviews, root cause analysis, and systemic improvements across multiple teams;
- Expertise in identifying and addressing system bottlenecks, latency issues, and throughput constraints in distributed environments;
- Proficiency in forecasting demand, planning capacity, and managing system growth in a cost-efficient and sustainable manner;
- Strong track record of partnering with software engineering, infrastructure, product, and business teams to embed resilience into the full development lifecycle;
- Fluency in English.
Benefits
- Hybrid Work Model - Flexible hybrid work model with 8-10 in-office days per month, managed by team leaders;
- Vodafone Products and Services - Employees get a mobile phone, free communication plan, data card, and various discounts on services and products;
- Recognition - Recognition programs for innovative, creative, high-potential employees and exemplary behaviors;
- Health and Well-being - Well-being Program offers nutrition and psychological consultations, webinars, workshops, and discounts on various services and products;
- Learning - Access to Communities of Practice and a customizable digital training platform with high-quality content (namely Harvard Business Publishing, Skillsoft and Speexx);
- Local and International Mobility - Internal recruitment with local and international rotation opportunities across departments and roles.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
PythonBashGoPowerShellchaos engineeringBusiness ContinuityISO 22301Site Reliability Engineeringtelemetrymonitoring frameworks
Soft skills
collaborationcommunicationleadershipproblem-solvingdata-driven decision makingcustomer experience focusrelationship buildingpost-incident reviewworkshop facilitationoutcome-driven