Salary
💰 $118,000 - $150,000 per year
Tech Stack
AWSDjangoDockerKubernetesLinuxPostgresPythonRabbitMQRDBMSTerraform
About the role
- Teach and support product teams on best practices for reliability, implementation patterns and effective usage of our existing platforms
- Support product teams in improving the performance and availability of their systems
- Be hands-on in code and infrastructure to help product teams with reliability improvements
- Provide comprehensive feedback to the wider Platform group on improvements to be made to core infrastructure based on observations and first-hand experience in the code base
- Support the build-out of proof-of-concept requirements in product teams as needed to evolve application deployment architecture to align with business growth as well as enhance scalability and system resilience
- Collaborate with product teams to support the release of new features and services, ensuring adherence to reliability and performance standards
- Guide product teams in designing systems for resilience and graceful failure under heavy load
- Assist application teams with post-incident tasks and follow-ups, and contribute to the creation and review of post-mortem documentation
- Analyse incident metrics to identify trends and potential improvements, communicating these insights to the product teams
- Help solve interesting and difficult problems to drive disruption in the global energy market
Requirements
- Great communication skills, working effectively with developers, product managers and other business stakeholders to understand, design and deliver impactful projects and reliability improvements
- Proficient using AWS; we use a lot of different AWS services and not just the standard few
- Strong Python skills; particularly with Django, the Django ORM and Celery
- Expertise in PostgreSQL or a similar RDBMS, particularly in Amazon RDS at scale
- Experience with Docker and Kubernetes (we use Amazon EKS in production)
- Experience with Datadog or a similar logging/monitoring tool
- Experience with messaging queues, event-driven async processing (we use RabbitMQ)
- Experience with Terraform or a similar infrastructure-as-code tool
- Experience working with a Linux distribution
- Previous experience working in small, highly-autonomous teams
- (Helpful) Previous experience as a Site Reliability Engineer
- (Helpful) Experience working on SaaS platforms and engaging product teams
- (Helpful) Experience managing and supporting large scale internet-facing services
- (Helpful) Experience responding to incidents and outages, writing incident reports and organising retrospectives
- (Helpful) Experience working with very large relational databases
- (Helpful) Experience using service level objectives to improve application performance
- (Helpful) A proactive, innovative mindset