Vodafone

Resilience Engineer

Vodafone

full-time

Posted on:

Location Type: Hybrid

Location: LisboaPortugal

Visit company website

Explore more

AI Apply
Apply

About the role

  • Developing and governing resilience strategies across system architecture, deployment, monitoring, and incident response;
  • Defining and tracking stability KPIs (e.g., MTTD, MTTR, error budgets), partnering with performance and operations teams to meet or exceed targets;
  • Designing and implementing fault injection testing, chaos engineering practices, and scenario-based simulations to validate platform robustness;
  • Collaborating with product, infrastructure, architecture and development teams to re-design services with built-in redundancy, failover, and graceful degradation;
  • Driving automation and observability improvements to reduce noise, increase fault detection speed, and support predictive failure mitigation;
  • Contributing to the design and maintenance of our Business Continuity and Disaster Recovery Plan (BCDR), ensuring IoT systems remain resilient and recoverable in the face of unexpected disruptions;
  • Owning the resilience roadmap and continuously assessing emerging threats, technologies, and architectural shifts to guide evolution of stability practices;
  • Evangelizing a culture of resilience through internal communication, workshops, and post-incident learning programs;
  • Deliver high-quality engineering solutions while continuously strengthening the resilience, scalability, and cost efficiency of our IoT platform;
  • Consistently meet or exceed delivery expectations by prioritizing the highest-leverage resilience initiatives that improve customer experience, business outcomes, and financial performance;
  • Build trusted, transparent, and outcome-driven relationships by providing clear technical direction and trade-off recommendations to business and engineering stakeholders.

Requirements

  • Educated to BSc degree level in Software Engineer or related discipline with Computer Science
  • Strong scripting and automation experience (e.g., Python, Bash, Go, PowerShell), with a demonstrated ability to replace manual processes with reliable, scalable automation;
  • Proven experience designing and operating high-availability, fault-tolerant systems, including the use of chaos engineering techniques and proactive failure-mitigation strategies;
  • Experience applying Business Continuity and resilience standards (e.g., ISO 22301) in the context of real-world platform design and operational readiness;
  • Hands-on experience designing or integrating monitoring, alerting, and automated testing frameworks to support early fault detection and system validation;
  • Broad experience working with Linux-based platforms across on-premises and cloud environments, with an understanding of how infrastructure choices impact reliability, scalability, and recovery;
  • Deep expertise in Site Reliability Engineering principles, including SLOs/SLIs, error budgets, observability, toil reduction, and automation, with the ability to apply them at platform and system scale to guide architectural decisions and long-term resilience strategy;
  • Proven ability to balance long-term platform stability with delivery velocity by making clear, data-driven trade-offs;
  • Strong understanding of security principles, practices, and standards, and the ability to incorporate them into resilient, real-world technical solutions;
  • Deep command of telemetry, logging, and alerting ecosystems (e.g., Prometheus, Grafana, ELK, Datadog, Splunk), with the ability to design signals that enable early fault detection and informed decision-making;
  • Experience defining meaningful SLIs and building dashboards that drive architectural insight, prioritization, and corrective action;
  • Proven experience leading blameless post-incident reviews, root cause analysis, and systemic improvements across multiple teams;
  • Expertise in identifying and addressing system bottlenecks, latency issues, and throughput constraints in distributed environments;
  • Proficiency in forecasting demand, planning capacity, and managing system growth in a cost-efficient and sustainable manner;
  • Strong track record of partnering with software engineering, infrastructure, product, and business teams to embed resilience into the full development lifecycle;
  • Fluency in English.
Benefits
  • Hybrid Work Model - Flexible hybrid work model with 8-10 in-office days per month, managed by team leaders;
  • Vodafone Products and Services - Employees get a mobile phone, free communication plan, data card, and various discounts on services and products;
  • Recognition - Recognition programs for innovative, creative, high-potential employees and exemplary behaviors;
  • Health and Well-being - Well-being Program offers nutrition and psychological consultations, webinars, workshops, and discounts on various services and products;
  • Learning - Access to Communities of Practice and a customizable digital training platform with high-quality content (namely Harvard Business Publishing, Skillsoft and Speexx);
  • Local and International Mobility - Internal recruitment with local and international rotation opportunities across departments and roles.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
PythonBashGoPowerShellchaos engineeringBusiness ContinuityISO 22301Site Reliability Engineeringtelemetrymonitoring frameworks
Soft skills
collaborationcommunicationleadershipproblem-solvingdata-driven decision makingcustomer experience focusrelationship buildingpost-incident reviewworkshop facilitationoutcome-driven