Tech Stack
AnsibleAWSAzureCloudDjangoDNSDockerFirewallsGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPythonTCP/IPTerraform
About the role
- Reliability & Support: Act as a primary point of contact for L2 support issues, troubleshooting complex problems across our stack and driving them to resolution. Implement permanent fixes and preventative measures to reduce recurrence.
- Automation & Tooling: Design, develop, and implement automation solutions for common operational tasks, system provisioning, maintenance, and incident response. Reduce operational "toil" through smart tooling.
- Observability & Monitoring: Lead the strategy and implementation of comprehensive monitoring, logging, and alerting systems. Enhance our observability stack to provide deep insights into system health and performance.
- System Architecture & Design: Collaborate with development and product teams to design and build scalable, reliable, and secure infrastructure and applications. Provide SRE perspective on new features and architectural decisions.
- Incident Management: Participate in on-call rotations, respond to incidents, perform root cause analyses, and implement post-incident actions to prevent future occurrences.
- Mentorship & Leadership: Mentor junior SREs and other engineering team members, sharing best practices in reliability, operations, and software development. Potentially lead small projects or initiatives.
- Performance Optimization: Identify and address performance bottlenecks across infrastructure and applications.
- Documentation: Create and maintain thorough documentation for systems, processes, and playbooks.
Requirements
- Expert-level proficiency in Python for scripting, automation, and tooling.
- Experience with Django frameworks is a strong plus.
- 7+ years of experience in a Site Reliability Engineering, DevOps, or similar role with a strong focus on system reliability and automation.
- Deep understanding and extensive experience with Linux operating systems (Ubuntu preferred), including system administration, networking, and troubleshooting.
- Extensive experience with containerization technologies, especially Docker.
- Strong practical experience with container orchestration platforms, specifically Kubernetes, including deployment, management, and troubleshooting of clusters and applications.
- Demonstrated experience in mentorship, team leadership, or technical management.
- You should be comfortable guiding, coaching, and developing less experienced engineers.
- Methodical approach to problem-solving: Ability to systematically diagnose complex issues, analyze data, and propose effective solutions.
- Self-motivated and proactive: Takes initiative, identifies areas for improvement, and drives projects to completion with minimal supervision.
- Excellent communication skills: Ability to articulate complex technical concepts clearly to both technical and non-technical audiences, strong written communication, and ability to collaborate effectively across teams.
- Experience with cloud platforms (e.g., AWS, GCP, Azure) is a significant advantage, with AWS preferred.
- Experience with CI/CD pipelines and related tools.
- Familiarity with infrastructure as code (e.g., Terraform, Ansible).
- Understanding of networking concepts (TCP/IP, DNS, Load Balancing, Firewalls).
- Experience with various monitoring and alerting tools (e.g., Prometheus, Grafana, ELK Stack, Datadog, New Relic).