
Site Reliability Engineer
Trainline
full-time
Posted on:
Location Type: Hybrid
Location: London • United Kingdom
Visit company websiteExplore more
Salary
💰 £55,000 - £63,000 per year
About the role
- Developing an understanding of system architecture, dependencies, and failure modes across the Trainline platform
- Participating in production incident response, supporting investigation, mitigation, communication, and coordinated service restoration
- Contributing to post-incident reviews and follow-up actions to improve reliability, scalability, and resilience
- Taking part in the SRE on-call rotation
- Designing, building, and maintaining observability using metrics, logs, events, and traces to support effective detection and diagnosis
- Improving monitoring and alerting by aligning signals to business and customer impact, reducing noise and improving mean time to detection (MTTD)
- Ensuring relevant operational data is surfaced quickly and clearly during live incidents
- Making informed tooling and technology choices using SRE principles, balancing team and business needs
- Supporting AWS-hosted infrastructure and shared platform services using infrastructure-as-code and CI/CD tooling
- Collaborating with product engineering teams to ensure services are operationally ready and deployed safely
- Advising on reliability and resilience practices
- Writing and maintaining reliable, well-structured code and scripts to support reliability and observability goals
- Prioritising work effectively and collaborating using agile processes to deliver against team and business goals
Requirements
- Experience of SRE concepts such as SLI, SLO and error budgets.
- Hands-on experience with observability tooling such as New Relic, Elastic (ELK Stack), Influx, Grafana or similar
- Experience working with cloud providers (preferably AWS).
- Experience troubleshooting Linux operating systems.
- Experience of scripting in at least one language (preferably Python)
- Understanding of load balancing and reverse proxy concepts, upstream config concepts, upstream health checks, worker & data flow concepts.
- Application architecture concepts (threading, queuing, readiness checks, health checks, circuit breakers, timeouts, exponential backoff, throttling).
- Experience building, maintaining and evolving time series data, retention, cardinality, deviation, moving averages and other functions.
- Experience with build, deployment & configuration management tooling such as GitHub Actions and Terraform.
Benefits
- private healthcare & dental insurance
- generous work from abroad policy
- 2-for-1 share purchase plans
- EV Scheme to reduce carbon emissions
- extra festive time off
- excellent family-friendly benefits
- clear career paths
- transparent pay bands
- personal learning budgets
- regular learning days
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
SRE conceptsSLISLOerror budgetsobservability toolingNew RelicElastic (ELK Stack)InfluxGrafanascripting (Python)
Soft Skills
collaborationcommunicationprioritizationagile processes