Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Devsu

Senior Site Reliability Engineer – GCP

Devsu

Site Reliability Engineer in Devsu enhancing monitoring and observability on GCP. Responsibilities include incident response, dashboard creation, and platform reliability improvements while providing technical support.

Posted 5/19/2026full-timeRemote • 🇵🇪 PeruSeniorWebsite

Tech Stack

Tools & technologies
CloudGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPythonServiceNow

About the role

Key responsibilities & impact
  • Own and operate the monitoring and observability stack across on-prem and GCP environments
  • Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
  • Define, tune, and maintain alerts to ensure high signal-to-noise ratio
  • Establish observability standards and best practices across teams
  • Improve visibility into system health, performance, and reliability
  • Apply SRE principles to improve availability, performance, and resilience
  • Define and track SLIs, SLOs, and error budgets
  • Participate in on-call rotations and SEV incident response
  • Lead or contribute to incident investigations and root cause analysis (RCA)
  • Drive preventative actions to reduce repeat incidents
  • Support and monitor Kubernetes environments (GKE and on-prem clusters)
  • Monitor cluster health, capacity, and resource utilization
  • Troubleshoot platform-level issues impacting application reliability
  • Collaborate with Platform and Engineering teams on reliability improvements
  • Provide L2/L3 application support coverage during:
  • Support team resource shortages
  • High-severity incidents (SEVs)
  • Peak support periods or escalations
  • Triage and troubleshoot application issues using existing runbooks and dashboards
  • Collaborate with Application Support and Engineering teams during incidents
  • Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)

Requirements

What you’ll need
  • Strong experience as a **Site Reliability Engineer or Reliability Engineer**
  • Deep hands-on expertise with **Grafana **(dashboards, alerting, troubleshooting)
  • Solid experience with monitoring and observability systems
  • Production experience operating **Kubernetes **environments
  • Experience supporting systems in **GCP **and on-prem environments (mandatory)
  • Strong **Linux **systems and troubleshooting skills
  • Fluent **English **(written and spoken).
  • Ability to work in** PST time zone.**
  • Ability to participate in an **on-call rotation **that includes coverage for one weekend day. Time worked during the weekend is compensated with one day off during the week, in accordance with the established work schedule.
  • Technology Stack:
  • Observability: Grafana, Prometheus, logging platforms
  • Containers: Kubernetes (GKE and on-prem)
  • Cloud: Google Cloud Platform (GCP)
  • Operations: Linux, networking, infrastructure monitoring
  • Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents)
  • Nice to have:
  • Experience supporting application teams during SEV incidents
  • Knowledge of capacity planning and performance tuning
  • Scripting skills (Python, Bash, etc.)
  • Experience with hybrid infrastructure environments

Benefits

Comp & perks
  • A stable, long-term contract with opportunities for career growth
  • Private health insurance
  • A remote-friendly culture that promotes work-life balance
  • Continuous training, mentorship, and learning programs to keep you at the forefront of the industry
  • Free access to AI training resources and state-of-the-art AI tools to elevate your daily work
  • A flexible Paid Time Off (PTO) policy as well as paid holiday days
  • Challenging, world-class software projects for clients in the US and LatAm
  • Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Site Reliability EngineerReliability EngineerGrafanaKubernetesGCPLinuxmonitoring systemsobservability systemsscripting (Python, Bash)capacity planning
Soft Skills
communicationcollaborationtroubleshootingincident responseroot cause analysispreventative actionsdocumentationon-call rotationtime managementproblem-solving