MariaDB

Senior Site Reliability Engineer

MariaDB

full-time

Posted on:

Location Type: Remote

Location: Malaysia

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Design, implement, and evolve large-scale, cloud-native infrastructure supporting our global SaaS platform.
  • Lead reliability and scalability initiatives that span multiple teams and services, driving automation and resilience through infrastructure-as-code and GitOps practices.
  • Proactively identify and remediate systemic reliability issues, ensuring high service availability and performance across multi-cloud environments.
  • Collaborate with software and platform teams to integrate reliability principles, SLOs, and observability standards into every stage of the development lifecycle.
  • Act as a key technical leader during major incidents—coordinating response efforts, conducting root cause analysis, and implementing long-term corrective actions.
  • Contribute to continuous improvement by defining infrastructure patterns, refining CI/CD workflows, and mentoring other engineers in automation and reliability best practices.

Requirements

  • At least 7 years of hands-on experience as an SRE, DevOps, or Infrastructure Engineer in production cloud environments.
  • Strong expertise with Kubernetes operations and ecosystem tooling in production-scale clusters.
  • Proven experience designing and maintaining multi-cloud infrastructure across Azure, AWS, or GCP.
  • Advanced proficiency with Terraform and Terragrunt, capable of designing modular, reusable, and secure IaC components.
  • Solid understanding of GitOps principles and deployment automation using ArgoCD or similar tools.
  • Deep experience with Linux systems administration, performance tuning, and troubleshooting.
  • Proficiency in one or more programming/scripting languages (Python, Bash, Go preferred).
  • Strong understanding of observability concepts and experience working with monitoring and alerting tools such as Prometheus, Grafana, and Thanos.
  • Experience participating in or leading on-call rotations, handling incident response, and conducting post-incident reviews.
Benefits
  • 25 days paid annual leave (plus holidays)
  • Competitive compensation package
  • Flexibility and freedom

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
cloud-native infrastructureinfrastructure-as-codeGitOpsKubernetesTerraformTerragruntLinux systems administrationPythonBashGo
Soft skills
leadershipcollaborationproblem-solvingmentoringincident responseroot cause analysiscontinuous improvement