
Site Reliability Engineer
SupplyHouse.com
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇮🇳 India
Visit company websiteSalary
💰 $29,000 - $36,000 per year
Job Level
Mid-LevelSenior
Tech Stack
AnsibleCloudDockerGoGoogle Cloud PlatformGrafanaJenkinsKubernetesLinuxPrometheusPythonSQLTerraformUnix
About the role
- Ensure the scalability, reliability, and performance of our infrastructure and applications with a focus on automation, monitoring, and incident response
- Design, build, and maintain scalable, reliable systems on GCP (Compute Engine, GKE, Cloud Storage, Cloud SQL)
- Develop automation for infrastructure provisioning using Terraform, Ansible, or Deployment Manager
- Build and maintain observability platforms (monitoring, logging, tracing) using tools such as Stackdriver (Cloud Monitoring), Prometheus, or Grafana
- Manage incident response, conduct postmortems, and implement improvements to reduce recurrence
- Partner with DevOps and engineering teams to enhance CI/CD pipelines for resilient deployments
- Define and monitor SLAs, SLOs, and SLIs to ensure application availability and performance
- Implement disaster recovery (DR) and backup strategies across cloud services
- Continuously optimize performance, capacity, and cost-efficiency of GCP resources
Requirements
- Bachelors degree in Computer Science, Engineering, or a related field
- 3+ years of hands-on experience as a Site Reliability Engineer, DevOps Engineer, Systems Engineer, or Cloud Infrastructure Engineer. Proven track record managing production-grade systems on Google Cloud Platform (GCP) or other cloud providers
- Strong understanding of Linux/Unix system administration, networking, and troubleshooting.
- Experience implementing Infrastructure as Code (IaC) using tools like Terraform, Ansible, or Deployment Manager
- Familiarity with containerization and orchestration technologies such as Docker and Kubernetes (GKE)
- Experience with monitoring and observability tools (Google Cloud Operations Suite, Prometheus, Grafana, Datadog, ELK).
- Experience defining and monitoring SLAs, SLOs, and SLIs to ensure application uptime and performance.
- Proven ability to handle incident response, conduct postmortems, and drive root cause analysis
- Proficiency in at least one scripting language (Python, Bash, or Go) for automation and tooling. Hands-on experience building or managing CI/CD pipelines (Jenkins, GitLab CI, Cloud Build).
- Strong background in configuration management and release automation
- Knowledge of IAM (Identity and Access Management), network security, and cloud compliance controls. Familiarity with disaster recovery (DR), backups, and high-availability design
Benefits
- Comprehensive and affordable medical, dental, vision, and life insurance options
- Competitive Provident Fund contributions
- Paid time off and holidays
- Mental health support and wellbeing program
- Company-provided equipment and one-time $250 USD work from home stipend
- $750 USD annual professional development budget
- Company rewards and recognition program
- And more!
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Google Cloud Platform (GCP)TerraformAnsibleDeployment ManagerLinux/Unix system administrationDockerKubernetesPythonBashGo
Soft skills
incident responseroot cause analysiscollaborationproblem-solvingcommunication
Certifications
Bachelor's degree in Computer ScienceBachelor's degree in Engineering