Tech Stack
AnsibleAWSAzureChefCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPuppetPythonRubyTerraform
About the role
- Develop and maintain monitoring and alerting systems to proactively identify and address issues.
- Troubleshoot and escalate production incidents to minimize downtime and improve system reliability.
- Continuously improve our infrastructure and processes to optimize scalability and efficiency.
- Participate and take ownership for on-call rotations as needed to ensure 24/7 support for our application.
- Perform routine maintenance and upgrades as needed to keep our systems up to date.
- Contribute to ongoing efforts to improve our security posture and compliance with industry standards.
- Communicate complex technical concepts clearly and concisely to both technical and non-technical stakeholders in order to make the right decision.
- Mentor and coach junior engineers, fostering their professional growth and enabling them to deliver high-quality work.
- Stay up-to-date with the latest advancements and trends in site reliability engineering and share knowledge and insights with the team.
- Identify opportunities for organizational enhancements and propose alternatives to optimize team structures and execution.
- Collaborate with development teams to design and implement automated deployment and testing pipelines.
- Collaborate with development teams to design and implement scalable Infrastructure.
Requirements
- Bachelor’s degree in Computer Engineering, Computer Science, or related field.
- 5+ years of experience in a similar role, preferably with experience in a high-traffic, high-availability environment.
- Proficiency in at least one programming language (Python, Ruby, Java, Go, etc.).
- Strong understanding of cloud infrastructure and related technologies (AWS, GCP, Azure, Kubernetes, Docker, etc.)
- Excellent troubleshooting and problem-solving skills.
- Experience with one or more automation and configuration management tools (Chef, Ansible, Puppet, Terraform, etc.).
- Familiarity with monitoring and alerting tools (Prometheus, Grafana, Nagios, etc.)
- Strong communication and interpersonal skills, enabling effective collaboration with cross-functional teams.
- Ability to navigate ambiguity, set clear expectations, and thrive in a fast-paced, dynamic environment.
- A strong grasp of computer science fundamentals when it comes to dealing with distributed systems and networks.