Veza

Senior Site Reliability Engineer

Veza

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇮🇳 India

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

AWSCloudGrafanaKubernetesLinuxPrometheusTerraform

About the role

  • Deploy software for Cloud Prem and SAAS customers.
  • Respond to and diagnose system incidents in a timely and efficient manner, minimizing downtime and impact on users.
  • Collaborate with other engineers to establish root causes and implement effective resolutions.
  • Continuously improve incident response processes and documentation for future occurrences.
  • Proactively monitor and maintain the health and performance of our infrastructure and services.
  • Perform routine administrative tasks such as system configuration, user management, and data backups.
  • Identify and implement operational improvements to ensure ongoing system reliability and efficiency.
  • Develop and implement scripts and automated solutions to streamline operational tasks and reduce manual workload.
  • Participate in the on-call rotation to address critical incidents outside of regular business hours.
  • Ensure effective handoff between on-call engineers and document post-incident information for future reference.
  • Document processes for support and create, maintain and execute run-books for identified situations
  • Provide tier 2/3 technical support to customers experiencing platform issues or requiring advanced troubleshooting
  • Work directly with customer technical teams to resolve complex deployment, configuration, and integration challenges
  • Conduct technical onboarding sessions and provide guidance on best practices for customer implementations
  • Collaborate with customer success teams to ensure smooth customer experiences and rapid issue resolution
  • Create and maintain customer-facing technical documentation, troubleshooting guides, and knowledge base articles
  • Escalate customer feedback and feature requests to product and engineering teams
  • Participate in customer calls and technical discussions to provide expert-level platform guidance
  • Track and analyze customer support metrics to identify trends and areas for improvement

Requirements

  • BS degree in Computer Science or related field
  • 3+ years of experience in Site Reliability Engineering
  • 2+ years experience working with cloud platform and cloud automation tools especially in AWS
  • Strong experience with Kubernetes, Linux, AWS networking(VPC) and Terraform
  • Experience with the GitOps model for deployment
  • Familiarity with distributed version control
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana)
  • Bazel and Helm experience a plus
  • Understanding of software configuration best practices
  • Ability to wear multiple hats in a fast-paced environment
  • Hands-on, “can do” attitude and a bias for action
  • Low ego and high intellectual curiosity
  • Comfortable working across time zones to support global customer base
  • Excellent communication skills with ability to explain technical concepts to both technical and non-technical audiences
  • Strong customer service orientation with patience and empathy when working with frustrated customers.
Benefits
  • equity
  • competitive benefits package

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
Site Reliability EngineeringAWSKubernetesLinuxAWS networkingTerraformGitOpsmonitoring toolsalerting toolssoftware configuration best practices
Soft skills
communication skillscustomer service orientationpatienceempathyability to work across time zonesadaptabilityproblem-solvingcollaborationtechnical onboardingdocumentation skills
Certifications
BS degree in Computer Science
Ollion

DevOps Engineer, AWS Networking

Ollion
Mid · Seniorfull-time🇮🇳 India
Posted: 9 days agoSource: jobs.smartrecruiters.com
AWSAzureCloudDockerElasticSearchJenkinsKubernetesLinuxMongoDBMySQLNoSQLPostgres+6 more