Deploy software for Cloud Prem and SAAS customers.
Respond to and diagnose system incidents in a timely and efficient manner, minimizing downtime and impact on users.
Collaborate with other engineers to establish root causes and implement effective resolutions.
Continuously improve incident response processes and documentation for future occurrences.
Proactively monitor and maintain the health and performance of our infrastructure and services.
Perform routine administrative tasks such as system configuration, user management, and data backups.
Identify and implement operational improvements to ensure ongoing system reliability and efficiency.
Develop and implement scripts and automated solutions to streamline operational tasks and reduce manual workload.
Participate in the on-call rotation to address critical incidents outside of regular business hours.
Ensure effective handoff between on-call engineers and document post-incident information for future reference.
Document processes for support and create, maintain and execute run-books for identified situations
Provide tier 2/3 technical support to customers experiencing platform issues or requiring advanced troubleshooting
Work directly with customer technical teams to resolve complex deployment, configuration, and integration challenges
Conduct technical onboarding sessions and provide guidance on best practices for customer implementations
Collaborate with customer success teams to ensure smooth customer experiences and rapid issue resolution
Create and maintain customer-facing technical documentation, troubleshooting guides, and knowledge base articles
Escalate customer feedback and feature requests to product and engineering teams
Participate in customer calls and technical discussions to provide expert-level platform guidance
Track and analyze customer support metrics to identify trends and areas for improvement
Requirements
BS degree in Computer Science or related field
3+ years of experience in Site Reliability Engineering
2+ years experience working with cloud platform and cloud automation tools especially in AWS
Strong experience with Kubernetes, Linux, AWS networking(VPC) and Terraform
Experience with the GitOps model for deployment
Familiarity with distributed version control
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana)
Bazel and Helm experience a plus
Understanding of software configuration best practices
Ability to wear multiple hats in a fast-paced environment
Hands-on, “can do” attitude and a bias for action
Low ego and high intellectual curiosity
Comfortable working across time zones to support global customer base
Excellent communication skills with ability to explain technical concepts to both technical and non-technical audiences
Strong customer service orientation with patience and empathy when working with frustrated customers
Benefits
Competitive salary
Equity
Health insurance
Paid time off
Flexible working hours
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Site Reliability EngineeringAWSKubernetesLinuxTerraformGitOpsPrometheusGrafanaBazelHelm
Soft skills
communication skillscustomer service orientationpatienceempathyability to work in fast-paced environmentintellectual curiositycollaborationproblem-solvingadaptabilityattention to detail