Senior Site Reliability Engineer

MRSOOL | مرسول

full-time

Posted on: 9/18/2025

Origin: • 🇪🇬 Egypt

✨ AI Apply

Senior

AnsibleAWSAzureChefCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPuppetPythonRubyTerraform

About the role

Develop and maintain monitoring and alerting systems to proactively identify and address issues.
Troubleshoot and escalate production incidents to minimize downtime and improve system reliability.
Continuously improve our infrastructure and processes to optimize scalability and efficiency.
Participate and take ownership for on-call rotations as needed to ensure 24/7 support for our application.
Perform routine maintenance and upgrades as needed to keep our systems up to date.
Contribute to ongoing efforts to improve our security posture and compliance with industry standards.
Communicate complex technical concepts clearly and concisely to both technical and non-technical stakeholders in order to make the right decision.
Mentor and coach junior engineers, fostering their professional growth and enabling them to deliver high-quality work.
Stay up-to-date with the latest advancements and trends in site reliability engineering and share knowledge and insights with the team.
Identify opportunities for organizational enhancements and propose alternatives to optimize team structures and execution.
Collaborate with development teams to design and implement automated deployment and testing pipelines.
Collaborate with development teams to design and implement scalable Infrastructure.

Bachelor’s degree in Computer Engineering, Computer Science, or related field.
5+ years of experience in a similar role, preferably with experience in a high-traffic, high-availability environment.
Proficiency in at least one programming language (Python, Ruby, Java, Go, etc.).
Strong understanding of cloud infrastructure and related technologies (AWS, GCP, Azure, Kubernetes, Docker, etc.)
Excellent troubleshooting and problem-solving skills.
Experience with one or more automation and configuration management tools (Chef, Ansible, Puppet, Terraform, etc.).
Familiarity with monitoring and alerting tools (Prometheus, Grafana, Nagios, etc.)
Strong communication and interpersonal skills, enabling effective collaboration with cross-functional teams.
Ability to navigate ambiguity, set clear expectations, and thrive in a fast-paced, dynamic environment.
A strong grasp of computer science fundamentals when it comes to dealing with distributed systems and networks.