Support our production platforms, including participating in our on-call rotation
Work closely with service teams to continuously improve the reliability, scalability, and performance of our systems
Develop automation solutions and drive improvements in automation, observability, and reliability practices
Troubleshoot and resolve production incidents, contributing to sustainable long-term solutions
Mentor and support other SRE team members and provide expert guidance to development teams
Lead complex changes, identify and deliver impactful SRE-led projects, and influence operating strategy
Partner with infrastructure teams to evolve and strengthen our platform
Requirements
6+ years of experience in SRE or equivalent operationally focussed engineering roles
Experience of Linux administration will be a day-one skill
Experience of operating live, production-grade Kubernetes environments
Expertise in problem diagnosis across complex, distributed systems
Proficiency in a scripting language suited to automation (e.g., Python, Bash)
Experience with Git version control and modern CI/CD and DevOps practices
Ability to participate in on-call rotation and troubleshoot production incidents
(Desirable) Hands-on experience with one or more public clouds (AWS, GCP, Azure)
(Desirable) Experience with Event Streaming, Exception Management, and Integration technologies such as Apache Pulsar
(Desirable) Experience with Stream-processing and batch-processing frameworks such as Apache Flink
(Desirable) Experience with configuration management, and infrastructure as code
(Desirable) Knowledge of observability and monitoring best practices
(Desirable) Prior experience mentoring or coaching other engineers
Benefits
Commitment to Diversity, Equity, Inclusion and Belonging (DEIB)
Reasonable accommodation for individuals with disabilities to participate in the application, perform essential functions, and receive equitable benefits
ATS Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Linux administrationKubernetesproblem diagnosisPythonBashGitCI/CDDevOpsEvent StreamingApache Flink
Soft skills
mentoringsupporting team membersguidanceleadershiptroubleshootingcollaborationinfluencing strategycommunicationproblem-solvingimprovement driving