Tech Stack
ApacheAWSAzureDistributed SystemsGoogle Cloud PlatformKubernetesLinuxPulsarPython
About the role
- Support production platforms and participate in on-call rotation
- Work closely with service teams to continuously improve reliability, scalability, and performance of systems
- Develop automation solutions and drive improvements in automation, observability, and reliability practices
- Troubleshoot and resolve production incidents and contribute to sustainable long-term solutions
- Mentor and support other SRE team members and provide expert guidance to development teams
- Lead complex changes, influence operating strategy, and identify and deliver impactful SRE-led projects
- Partner with infrastructure teams to evolve and strengthen the platform
Requirements
- 6+ years of experience in SRE or equivalent operationally focussed engineering roles
- Experience of Linux administration will be a day-one skill
- Experience of operating live, production-grade Kubernetes environments
- Expertise in problem diagnosis across complex, distributed systems
- Proficiency in a scripting language suited to automation (e.g., Python, Bash)
- Experience with Git version control and modern CI/CD and DevOps practices
- Participate in on-call rotation
- Hands-on experience with one or more public clouds (AWS, GCP, Azure) (desirable)
- Experience with Event Streaming, Exception Management, and Integration technologies such as Apache Pulsar (desirable)
- Experience with Stream-processing and batch-processing frameworks such as Apache Flink (desirable)
- Experience with configuration management, and infrastructure as code (desirable)
- Knowledge of observability and monitoring best practices (desirable)
- Prior experience mentoring or coaching other engineers (desirable)