Site Reliability Engineer

BT Group

full-time

Posted on: 1/29/2026

Location Type: Office

Location: Bengaluru • India

Visit company website

Explore more

DevOps Engineer jobs

✨ AI Apply

Apply

Job Level

Mid-Level Senior

Tech Stack

AWS Azure Cloud Distributed Systems Google Cloud Platform Java Microservices Python SDLC

About the role

Implement and operate CI/CD and SDLC automation using cloud services, infrastructure as code (IaC), GitOps patterns and containers, following established engineering and security practices.
Contribute to test planning and execution with delivery and QA to meet quality and time goals; help define practical coverage and schedules.
Automate to reduce toil and MTTR (mean time to resolve)—scripts, runbooks and guard railed tasks that remove repetitive work and improve recovery speed.
Participate in Tier 2/3 incident response: diagnose, mitigate and recover; capture learnings and drive preventive follow ups.
Implement and tune observability (metrics, logs, traces, dashboards, alerts) to improve signal quality and reduce noise.
Apply SRE fundamentals with teams: define/maintain SLIs and SLOs with error budgets; propose data driven reliability improvements.
Harden release reliability: keep pipelines stable, safe and reliable; identify configuration drift and remediate quickly.
Assist on call readiness: runbook stewardship, change/rollback safety, participation in DR/failover exercises and game days.
Identify reliability risks across services and environments; raise issues early and assist mitigations and control adoption.
Collaborate with developers, platform, operations and partners; document clearly and assist peer learning.

Requirements

Strong expertise in end to end observability and monitoring platforms (e.g., Dynatrace) to grasp system health, performance trends, and reliability of business critical applications.
Proficiency in one or more programming languages (e.g., Java, Python) with the ability to write production quality automation and tooling.
Hands on experience with cloud platforms (AWS, Azure, or GCP) and operating distributed systems in cloud and hybrid environments.
Firm Grasp of software architecture, design patterns, and microservices based systems.
Practical experience with CI/CD pipelines, DevOps practices, and continuous testing to Assist fast, reliable delivery.
Proven ability to apply Site Reliability Engineering principles, including automation, toil reduction, incident learning, and reliability driven system improvements.
Experience analysing complex, distributed systems to identify performance, resilience, and stability issues.
Ability to assist 24x7 operational environments, working effectively with stakeholders & backend teams and managed service partners during priority incidents.
Strong analytical, reporting, and presentation skills, enabling clear communication of operational insights, risks, and improvement opportunities.
Demonstrated mindset for business process improvement, using data and automation to drive efficiency and reliability gains.
Understanding of AIOps fundamentals, including cross domain telemetry ingestion, event correlation, topology and context modelling, and remediation augmentation.
Experience with AI assisted and agentic observability, using intelligent techniques to detect anomalies, correlate signals, and accelerate incident resolution.
Capability in AI driven alerting and noise reduction, designing contextual, business impact aware alerts and leveraging machine learning to prioritise and reduce alert fatigue.
AIOps capabilities: event correlation, dynamic topology/context modelling, impact-aware alerting and alert noise reduction features in modern observability platforms.
Exposure to controlled fault injection with tools like Gremlin/Litmus/Chaos Mesh; translating findings into tangible reliability improvements.
Model drift/freshness concepts and high‑level SLIs/SLOs for ML services; basic approaches to monitoring model health signals.

Benefits

Flexible working hours

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

CI/CDSDLC automationinfrastructure as code (IaC)GitOpscontainersprogramming languages (Java, Python)cloud platforms (AWS, Azure, GCP)observability (metrics, logs, traces, dashboards, alerts)Site Reliability Engineering (SRE)AIOps

Soft Skills

analytical skillsreporting skillspresentation skillscollaborationcommunicationproblem-solvingstakeholder managementincident responseprocess improvementlearning mindset