BT Group

Site Reliability Engineer

BT Group

full-time

Posted on:

Location Type: Office

Location: BengaluruIndia

Visit company website

Explore more

AI Apply
Apply

About the role

  • Implement and operate CI/CD and SDLC automation using cloud services, infrastructure as code (IaC), GitOps patterns and containers, following established engineering and security practices.
  • Contribute to test planning and execution with delivery and QA to meet quality and time goals; help define practical coverage and schedules.
  • Automate to reduce toil and MTTR (mean time to resolve)—scripts, runbooks and guard railed tasks that remove repetitive work and improve recovery speed.
  • Participate in Tier 2/3 incident response: diagnose, mitigate and recover; capture learnings and drive preventive follow ups.
  • Implement and tune observability (metrics, logs, traces, dashboards, alerts) to improve signal quality and reduce noise.
  • Apply SRE fundamentals with teams: define/maintain SLIs and SLOs with error budgets; propose data driven reliability improvements.
  • Harden release reliability: keep pipelines stable, safe and reliable; identify configuration drift and remediate quickly.
  • Assist on call readiness: runbook stewardship, change/rollback safety, participation in DR/failover exercises and game days.
  • Identify reliability risks across services and environments; raise issues early and assist mitigations and control adoption.
  • Collaborate with developers, platform, operations and partners; document clearly and assist peer learning.

Requirements

  • Strong expertise in end to end observability and monitoring platforms (e.g., Dynatrace) to grasp system health, performance trends, and reliability of business critical applications.
  • Proficiency in one or more programming languages (e.g., Java, Python) with the ability to write production quality automation and tooling.
  • Hands on experience with cloud platforms (AWS, Azure, or GCP) and operating distributed systems in cloud and hybrid environments.
  • Firm Grasp of software architecture, design patterns, and microservices based systems.
  • Practical experience with CI/CD pipelines, DevOps practices, and continuous testing to Assist fast, reliable delivery.
  • Proven ability to apply Site Reliability Engineering principles, including automation, toil reduction, incident learning, and reliability driven system improvements.
  • Experience analysing complex, distributed systems to identify performance, resilience, and stability issues.
  • Ability to assist 24x7 operational environments, working effectively with stakeholders & backend teams and managed service partners during priority incidents.
  • Strong analytical, reporting, and presentation skills, enabling clear communication of operational insights, risks, and improvement opportunities.
  • Demonstrated mindset for business process improvement, using data and automation to drive efficiency and reliability gains.
  • Understanding of AIOps fundamentals, including cross domain telemetry ingestion, event correlation, topology and context modelling, and remediation augmentation.
  • Experience with AI assisted and agentic observability, using intelligent techniques to detect anomalies, correlate signals, and accelerate incident resolution.
  • Capability in AI driven alerting and noise reduction, designing contextual, business impact aware alerts and leveraging machine learning to prioritise and reduce alert fatigue.
  • AIOps capabilities: event correlation, dynamic topology/context modelling, impact-aware alerting and alert noise reduction features in modern observability platforms.
  • Exposure to controlled fault injection with tools like Gremlin/Litmus/Chaos Mesh; translating findings into tangible reliability improvements.
  • Model drift/freshness concepts and high‑level SLIs/SLOs for ML services; basic approaches to monitoring model health signals.
Benefits
  • Flexible working hours
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
CI/CDSDLC automationinfrastructure as code (IaC)GitOpscontainersprogramming languages (Java, Python)cloud platforms (AWS, Azure, GCP)observability (metrics, logs, traces, dashboards, alerts)Site Reliability Engineering (SRE)AIOps
Soft Skills
analytical skillsreporting skillspresentation skillscollaborationcommunicationproblem-solvingstakeholder managementincident responseprocess improvementlearning mindset