
Site Reliability Engineer II
LivePerson
full-time
Posted on:
Location Type: Remote
Location: Bulgaria
Visit company websiteExplore more
Tech Stack
About the role
- Maintain and support existing products within the Echo ecosystem.
- Ensure high availability, performance, and reliability of platform services.
- Define, monitor, and improve SLOs, SLIs, and error budgets.
- Proactively identify system risks and implement reliability improvements.
- Participate in incident response, troubleshooting, and post-incident reviews.
- Deploy, manage, and optimize workloads on Google Kubernetes Engine (GKE).
- Manage cluster capacity, scaling strategies, and resource allocation.
- Optimize CPU, memory, and storage utilization to improve performance and reduce cost.
- Ensure cluster security, upgrades, and best practices are followed.
- Troubleshoot networking, service mesh (if applicable), ingress, and service-to-service communication issues.
- Implement and manage GitOps-based deployment workflows.
- Ensure infrastructure and application changes are version-controlled and automated.
- Work closely with developers to safely release code to production using CI/CD best practices.
- Support progressive delivery techniques (e.g., canary, blue/green deployments).
- Reduce deployment risk through automation and validation mechanisms.
- Implement and enhance observability practices across services.
- Build and maintain dashboards, alerts, and health metrics.
- Implement and manage OpenTelemetry (OTEL) for tracing and metrics collection.
- Ensure proactive alerting aligned with SLOs.
- Drive improvements in monitoring coverage and signal quality.
- Strong understanding of Kubernetes networking, services, ingress, load balancing, DNS, and service communication.
- Diagnose latency, connectivity, and traffic routing issues.
- Understand how distributed services interact across the ecosystem.
Requirements
- 4–7 years of experience in SRE, DevOps, or Platform Engineering roles
- Strong hands-on experience managing production workloads on GKE
- Solid experience with GitOps practices (ArgoCD, Flux, or similar)
- Strong understanding of Kubernetes networking and cloud networking fundamentals
- Experience optimizing resource allocation and scaling in Kubernetes
- Experience implementing observability solutions using OpenTelemetry (OTEL)
- Experience defining and operating with SLOs and SLIs
- Hands-on experience with CI/CD pipelines and automated deployments
- Strong troubleshooting and incident management experience
Benefits
- Health: medical, dental, and vision
- Time away: vacation and holidays
- Development: Generous tuition reimbursement and access to internal professional development resources
- Equal opportunity employer
- #LI-Remote
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Google Kubernetes Engine (GKE)GitOpsOpenTelemetry (OTEL)CI/CDSLOsSLIsnetworkingresource allocationscalingobservability
Soft Skills
troubleshootingincident managementproactive identification of riskscollaboration with developersmonitoring improvements