Senior Site Reliability Engineer

Veza

full-time

Posted on: 10/8/2025

Location Type: Remote

Location: Remote • 🇮🇳 India

Visit company website

✨ AI Apply

Apply

Job Level

Senior

Tech Stack

AWSCloudGrafanaKubernetesLinuxPrometheusTerraform

About the role

Deploy software for Cloud Prem and SAAS customers.
Respond to and diagnose system incidents in a timely and efficient manner, minimizing downtime and impact on users.
Collaborate with other engineers to establish root causes and implement effective resolutions.
Continuously improve incident response processes and documentation for future occurrences.
Proactively monitor and maintain the health and performance of our infrastructure and services.
Perform routine administrative tasks such as system configuration, user management, and data backups.
Identify and implement operational improvements to ensure ongoing system reliability and efficiency.
Develop and implement scripts and automated solutions to streamline operational tasks and reduce manual workload.
Participate in the on-call rotation to address critical incidents outside of regular business hours.
Ensure effective handoff between on-call engineers and document post-incident information for future reference.
Document processes for support and create, maintain and execute run-books for identified situations
Provide tier 2/3 technical support to customers experiencing platform issues or requiring advanced troubleshooting
Work directly with customer technical teams to resolve complex deployment, configuration, and integration challenges
Conduct technical onboarding sessions and provide guidance on best practices for customer implementations
Collaborate with customer success teams to ensure smooth customer experiences and rapid issue resolution
Create and maintain customer-facing technical documentation, troubleshooting guides, and knowledge base articles
Escalate customer feedback and feature requests to product and engineering teams
Participate in customer calls and technical discussions to provide expert-level platform guidance
Track and analyze customer support metrics to identify trends and areas for improvement

Requirements

BS degree in Computer Science or related field
3+ years of experience in Site Reliability Engineering
2+ years experience working with cloud platform and cloud automation tools especially in AWS
Strong experience with Kubernetes, Linux, AWS networking(VPC) and Terraform
Experience with the GitOps model for deployment
Familiarity with distributed version control
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana)
Bazel and Helm experience a plus
Understanding of software configuration best practices
Ability to wear multiple hats in a fast-paced environment
Hands-on, “can do” attitude and a bias for action
Low ego and high intellectual curiosity
Comfortable working across time zones to support global customer base
Excellent communication skills with ability to explain technical concepts to both technical and non-technical audiences
Strong customer service orientation with patience and empathy when working with frustrated customers.

Benefits

equity
competitive benefits package

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

Site Reliability EngineeringAWSKubernetesLinuxAWS networkingTerraformGitOpsmonitoring toolsalerting toolssoftware configuration best practices

Soft skills

communication skillscustomer service orientationpatienceempathyability to work across time zonesadaptabilityproblem-solvingcollaborationtechnical onboardingdocumentation skills