Senior Site Reliability Engineer – GCP

Devsu

Site Reliability Engineer in Devsu enhancing monitoring and observability on GCP. Responsibilities include incident response, dashboard creation, and platform reliability improvements while providing technical support.

Posted 5/19/2026full-timeRemote • 🇵🇪 PeruSeniorWebsite

Tech Stack

Tools & technologies

CloudGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPythonServiceNow

About the role

Key responsibilities & impact

Own and operate the monitoring and observability stack across on-prem and GCP environments
Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
Define, tune, and maintain alerts to ensure high signal-to-noise ratio
Establish observability standards and best practices across teams
Improve visibility into system health, performance, and reliability
Apply SRE principles to improve availability, performance, and resilience
Define and track SLIs, SLOs, and error budgets
Participate in on-call rotations and SEV incident response
Lead or contribute to incident investigations and root cause analysis (RCA)
Drive preventative actions to reduce repeat incidents
Support and monitor Kubernetes environments (GKE and on-prem clusters)
Monitor cluster health, capacity, and resource utilization
Troubleshoot platform-level issues impacting application reliability
Collaborate with Platform and Engineering teams on reliability improvements
Provide L2/L3 application support coverage during:
Support team resource shortages
High-severity incidents (SEVs)
Peak support periods or escalations
Triage and troubleshoot application issues using existing runbooks and dashboards
Collaborate with Application Support and Engineering teams during incidents
Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)

Requirements

What you’ll need

Strong experience as a **Site Reliability Engineer or Reliability Engineer**
Deep hands-on expertise with **Grafana **(dashboards, alerting, troubleshooting)
Solid experience with monitoring and observability systems
Production experience operating **Kubernetes **environments
Experience supporting systems in **GCP **and on-prem environments (mandatory)
Strong **Linux **systems and troubleshooting skills
Fluent **English **(written and spoken).
Ability to work in** PST time zone.**
Ability to participate in an **on-call rotation **that includes coverage for one weekend day. Time worked during the weekend is compensated with one day off during the week, in accordance with the established work schedule.
Technology Stack:
Observability: Grafana, Prometheus, logging platforms
Containers: Kubernetes (GKE and on-prem)
Cloud: Google Cloud Platform (GCP)
Operations: Linux, networking, infrastructure monitoring
Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents)
Nice to have:
Experience supporting application teams during SEV incidents
Knowledge of capacity planning and performance tuning
Scripting skills (Python, Bash, etc.)
Experience with hybrid infrastructure environments

Benefits

Comp & perks

A stable, long-term contract with opportunities for career growth
Private health insurance
A remote-friendly culture that promotes work-life balance
Continuous training, mentorship, and learning programs to keep you at the forefront of the industry
Free access to AI training resources and state-of-the-art AI tools to elevate your daily work
A flexible Paid Time Off (PTO) policy as well as paid holiday days
Challenging, world-class software projects for clients in the US and LatAm
Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Site Reliability EngineerReliability EngineerGrafanaKubernetesGCPLinuxmonitoring systemsobservability systemsscripting (Python, Bash)capacity planning

Soft Skills

communicationcollaborationtroubleshootingincident responseroot cause analysispreventative actionsdocumentationon-call rotationtime managementproblem-solving