Senior Site Reliability Engineer

Pave Bank

Site Reliability Engineer ensuring high availability and performance of production systems at Pave Bank. Collaborating with teams for infrastructure reliability in a fintech environment.

Posted 6/22/2026full-timeRemote • 🇲🇾 MalaysiaSeniorWebsite

Tech Stack

Tools & technologies

CloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesMicroservicesPrometheusPythonTerraform

About the role

Key responsibilities & impact

Monitor, maintain, and improve the reliability, availability, and performance of production systems and services.
Build and maintain infrastructure as code (IaC), deployment pipelines, and automation to support continuous delivery, scalability, and disaster recovery.
Respond to incidents, perform root-cause analysis, and drive postmortems to ensure lessons learned are applied.
Implement and enforce operational best practices: observability, logging, metrics, alerting, capacity planning, failover strategies, and backups.
Collaborate with Engineering, Product, Compliance, and Operations teams to ensure infrastructure meets reliability, compliance, and security standards.
Support service scaling, database operations, cloud infrastructure (GCP preferred), networking, and microservices orchestration.
Document operational runbooks, on-call procedures, and system architecture to support maintenance, knowledge sharing, and compliance.

Requirements

What you’ll need

Strong programming or scripting skills (Go, Python, Bash, or similar) for automation, tooling, and operational tasks.
Hands-on experience with cloud infrastructure, ideally Google Cloud Platform (GCP).
Familiarity with containerization and orchestration (Docker, Kubernetes, or equivalent).
Experience with infrastructure-as-code tools (Terraform, Cloud Deployment Manager, or similar).
Experience with either FluxCD or ArgoCD for GitOps-based delivery.
Solid understanding of distributed systems, microservices architecture, and reliability patterns.
Experience setting up monitoring, logging, alerting, and observability (e.g., Prometheus, Grafana, ELK, distributed tracing).
Strong troubleshooting skills and ability to respond to incidents under pressure.
Knowledge of backup and disaster recovery strategies, database management, and secure operations.

Benefits

Comp & perks

Competitive salary and meaningful equity with room for growth.

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

GoPythonBashcloud infrastructureGoogle Cloud PlatformDockerKubernetesTerraformFluxCDArgoCD

Soft Skills

troubleshootingincident responsecollaborationroot-cause analysispostmortem analysisknowledge sharingcapacity planningoperational best practices