
Senior Site Reliability Engineer
MariaDB
full-time
Posted on:
Location Type: Remote
Location: Malaysia
Visit company websiteExplore more
Job Level
About the role
- Design, implement, and evolve large-scale, cloud-native infrastructure supporting our global SaaS platform.
- Lead reliability and scalability initiatives that span multiple teams and services, driving automation and resilience through infrastructure-as-code and GitOps practices.
- Proactively identify and remediate systemic reliability issues, ensuring high service availability and performance across multi-cloud environments.
- Collaborate with software and platform teams to integrate reliability principles, SLOs, and observability standards into every stage of the development lifecycle.
- Act as a key technical leader during major incidents—coordinating response efforts, conducting root cause analysis, and implementing long-term corrective actions.
- Contribute to continuous improvement by defining infrastructure patterns, refining CI/CD workflows, and mentoring other engineers in automation and reliability best practices.
Requirements
- At least 7 years of hands-on experience as an SRE, DevOps, or Infrastructure Engineer in production cloud environments.
- Strong expertise with Kubernetes operations and ecosystem tooling in production-scale clusters.
- Proven experience designing and maintaining multi-cloud infrastructure across Azure, AWS, or GCP.
- Advanced proficiency with Terraform and Terragrunt, capable of designing modular, reusable, and secure IaC components.
- Solid understanding of GitOps principles and deployment automation using ArgoCD or similar tools.
- Deep experience with Linux systems administration, performance tuning, and troubleshooting.
- Proficiency in one or more programming/scripting languages (Python, Bash, Go preferred).
- Strong understanding of observability concepts and experience working with monitoring and alerting tools such as Prometheus, Grafana, and Thanos.
- Experience participating in or leading on-call rotations, handling incident response, and conducting post-incident reviews.
Benefits
- 25 days paid annual leave (plus holidays)
- Competitive compensation package
- Flexibility and freedom
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
cloud-native infrastructureinfrastructure-as-codeGitOpsKubernetesTerraformTerragruntLinux systems administrationPythonBashGo
Soft skills
leadershipcollaborationproblem-solvingmentoringincident responseroot cause analysiscontinuous improvement