Resilience Engineer

Vodafone

full-time

Posted on: 1/9/2026

Location Type: Hybrid

Location: Lisboa • Portugal

Visit company website

Explore more

Engineer jobs

✨ AI Apply

Apply

Job Level

Mid-Level Senior

Tech Stack

Cloud Go Grafana IoT Linux Prometheus Python Splunk

About the role

Developing and governing resilience strategies across system architecture, deployment, monitoring, and incident response;
Defining and tracking stability KPIs (e.g., MTTD, MTTR, error budgets), partnering with performance and operations teams to meet or exceed targets;
Designing and implementing fault injection testing, chaos engineering practices, and scenario-based simulations to validate platform robustness;
Collaborating with product, infrastructure, architecture and development teams to re-design services with built-in redundancy, failover, and graceful degradation;
Driving automation and observability improvements to reduce noise, increase fault detection speed, and support predictive failure mitigation;
Contributing to the design and maintenance of our Business Continuity and Disaster Recovery Plan (BCDR), ensuring IoT systems remain resilient and recoverable in the face of unexpected disruptions;
Owning the resilience roadmap and continuously assessing emerging threats, technologies, and architectural shifts to guide evolution of stability practices;
Evangelizing a culture of resilience through internal communication, workshops, and post-incident learning programs;
Deliver high-quality engineering solutions while continuously strengthening the resilience, scalability, and cost efficiency of our IoT platform;
Consistently meet or exceed delivery expectations by prioritizing the highest-leverage resilience initiatives that improve customer experience, business outcomes, and financial performance;
Build trusted, transparent, and outcome-driven relationships by providing clear technical direction and trade-off recommendations to business and engineering stakeholders.

Requirements

Educated to BSc degree level in Software Engineer or related discipline with Computer Science
Strong scripting and automation experience (e.g., Python, Bash, Go, PowerShell), with a demonstrated ability to replace manual processes with reliable, scalable automation;
Proven experience designing and operating high-availability, fault-tolerant systems, including the use of chaos engineering techniques and proactive failure-mitigation strategies;
Experience applying Business Continuity and resilience standards (e.g., ISO 22301) in the context of real-world platform design and operational readiness;
Hands-on experience designing or integrating monitoring, alerting, and automated testing frameworks to support early fault detection and system validation;
Broad experience working with Linux-based platforms across on-premises and cloud environments, with an understanding of how infrastructure choices impact reliability, scalability, and recovery;
Deep expertise in Site Reliability Engineering principles, including SLOs/SLIs, error budgets, observability, toil reduction, and automation, with the ability to apply them at platform and system scale to guide architectural decisions and long-term resilience strategy;
Proven ability to balance long-term platform stability with delivery velocity by making clear, data-driven trade-offs;
Strong understanding of security principles, practices, and standards, and the ability to incorporate them into resilient, real-world technical solutions;
Deep command of telemetry, logging, and alerting ecosystems (e.g., Prometheus, Grafana, ELK, Datadog, Splunk), with the ability to design signals that enable early fault detection and informed decision-making;
Experience defining meaningful SLIs and building dashboards that drive architectural insight, prioritization, and corrective action;
Proven experience leading blameless post-incident reviews, root cause analysis, and systemic improvements across multiple teams;
Expertise in identifying and addressing system bottlenecks, latency issues, and throughput constraints in distributed environments;
Proficiency in forecasting demand, planning capacity, and managing system growth in a cost-efficient and sustainable manner;
Strong track record of partnering with software engineering, infrastructure, product, and business teams to embed resilience into the full development lifecycle;
Fluency in English.

Benefits

Hybrid Work Model - Flexible hybrid work model with 8-10 in-office days per month, managed by team leaders;
Vodafone Products and Services - Employees get a mobile phone, free communication plan, data card, and various discounts on services and products;
Recognition - Recognition programs for innovative, creative, high-potential employees and exemplary behaviors;
Health and Well-being - Well-being Program offers nutrition and psychological consultations, webinars, workshops, and discounts on various services and products;
Learning - Access to Communities of Practice and a customizable digital training platform with high-quality content (namely Harvard Business Publishing, Skillsoft and Speexx);
Local and International Mobility - Internal recruitment with local and international rotation opportunities across departments and roles.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

PythonBashGoPowerShellchaos engineeringBusiness ContinuityISO 22301Site Reliability Engineeringtelemetrymonitoring frameworks

Soft skills

collaborationcommunicationleadershipproblem-solvingdata-driven decision makingcustomer experience focusrelationship buildingpost-incident reviewworkshop facilitationoutcome-driven