Tech Stack
AnsibleAWSAzureChefCloudDistributed SystemsDockerGoGrafanaJavaKubernetesLinuxNoSQLPrometheusPuppetPythonRubySplunkSQLTerraformUnix
About the role
- On-call Production Support: Provide production support on a shift according to the team on-call roster.
- Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support. For example, a client may request to correct some data on the database server which cannot be done through the web interface.
- Work on SREs backlog items.
- Monitor and Maintain Systems: Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability.
- Automate Processes: Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention.
- Incident Management: Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence.
- System Improvements: Participate in designing and implementing system improvements to enhance reliability, scalability, and performance.
- Collaboration: Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes.
- Documentation: Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team.
- Capacity Planning: Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth.
- Security: Implement and adhere to security best practices to protect our systems and data.
Requirements
- Technical Skills: Good understanding of Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud). Experience with scripting languages such as Python, Bash, or Ruby.
- Education: Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.
- Experience: 5+ years of experience in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems
- Technical Skills: Expertise in Linux/Unix systems, networking, and cloud services (AWS, Azure, or Google Cloud). Proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++).
- Tools and Technologies: Advanced knowledge of monitoring and logging tools (Prometheus, Grafana, Splunk), configuration management (Ansible, Chef, Puppet), and CI/CD pipelines.
- Problem-Solving: Strong analytical and problem-solving skills with the ability to diagnose and resolve complex issues efficiently.
- Communication: Excellent verbal and written communication skills, with the ability to convey complex technical concepts to non-technical stakeholders.
- Leadership: Demonstrated ability to lead and mentor a team, drive projects to completion, and manage cross-functional initiatives.
- 5+ years experience in a SRE, DevOps or Software Engineering role
- Certifications: Relevant certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or similar.
- Knowledge: In-depth understanding of containerization (Docker, Kubernetes) and infrastructure as code (Terraform, CloudFormation).
- Experience: Experience with database management (SQL, NoSQL), load balancing, and distributed systems.