
Site Reliability Engineer
BT Group
full-time
Posted on:
Location Type: Office
Location: Bengaluru • India
Visit company websiteExplore more
About the role
- Implement and operate CI/CD and SDLC automation using cloud services, infrastructure as code (IaC), GitOps patterns and containers, following established engineering and security practices.
- Contribute to test planning and execution with delivery and QA to meet quality and time goals; help define practical coverage and schedules.
- Automate to reduce toil and MTTR (mean time to resolve)—scripts, runbooks and guard railed tasks that remove repetitive work and improve recovery speed.
- Participate in Tier 2/3 incident response: diagnose, mitigate and recover; capture learnings and drive preventive follow ups.
- Implement and tune observability (metrics, logs, traces, dashboards, alerts) to improve signal quality and reduce noise.
- Apply SRE fundamentals with teams: define/maintain SLIs and SLOs with error budgets; propose data driven reliability improvements.
- Harden release reliability: keep pipelines stable, safe and reliable; identify configuration drift and remediate quickly.
- Assist on call readiness: runbook stewardship, change/rollback safety, participation in DR/failover exercises and game days.
- Identify reliability risks across services and environments; raise issues early and assist mitigations and control adoption.
- Collaborate with developers, platform, operations and partners; document clearly and assist peer learning.
Requirements
- Strong expertise in end to end observability and monitoring platforms (e.g., Dynatrace) to grasp system health, performance trends, and reliability of business critical applications.
- Proficiency in one or more programming languages (e.g., Java, Python) with the ability to write production quality automation and tooling.
- Hands on experience with cloud platforms (AWS, Azure, or GCP) and operating distributed systems in cloud and hybrid environments.
- Firm Grasp of software architecture, design patterns, and microservices based systems.
- Practical experience with CI/CD pipelines, DevOps practices, and continuous testing to Assist fast, reliable delivery.
- Proven ability to apply Site Reliability Engineering principles, including automation, toil reduction, incident learning, and reliability driven system improvements.
- Experience analysing complex, distributed systems to identify performance, resilience, and stability issues.
- Ability to assist 24x7 operational environments, working effectively with stakeholders & backend teams and managed service partners during priority incidents.
- Strong analytical, reporting, and presentation skills, enabling clear communication of operational insights, risks, and improvement opportunities.
- Demonstrated mindset for business process improvement, using data and automation to drive efficiency and reliability gains.
- Understanding of AIOps fundamentals, including cross domain telemetry ingestion, event correlation, topology and context modelling, and remediation augmentation.
- Experience with AI assisted and agentic observability, using intelligent techniques to detect anomalies, correlate signals, and accelerate incident resolution.
- Capability in AI driven alerting and noise reduction, designing contextual, business impact aware alerts and leveraging machine learning to prioritise and reduce alert fatigue.
- AIOps capabilities: event correlation, dynamic topology/context modelling, impact-aware alerting and alert noise reduction features in modern observability platforms.
- Exposure to controlled fault injection with tools like Gremlin/Litmus/Chaos Mesh; translating findings into tangible reliability improvements.
- Model drift/freshness concepts and high‑level SLIs/SLOs for ML services; basic approaches to monitoring model health signals.
Benefits
- Flexible working hours
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
CI/CDSDLC automationinfrastructure as code (IaC)GitOpscontainersprogramming languages (Java, Python)cloud platforms (AWS, Azure, GCP)observability (metrics, logs, traces, dashboards, alerts)Site Reliability Engineering (SRE)AIOps
Soft Skills
analytical skillsreporting skillspresentation skillscollaborationcommunicationproblem-solvingstakeholder managementincident responseprocess improvementlearning mindset