
DevOps Engineer / Site Reliability Engineer
TWO95 International, Inc
contract
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
About the role
- **Job Title: Lead SRE (Site Reliability Engineer )**
- **Location: Remote Work**
- **Type: 6+ Month Contract to hire**
- **Rate: $Open /hr.**
- Pl forward updated resume to **deivy.malli****@two95intl.com** and include your rate requirement along with your contact details with a suitable time when we can reach you.
- **Responsibilities **
- · Own uptime, SLAs, and overall reliability of cloud infrastructure and kiosks platform.
- · Lead incident response, root-cause analysis, and drive actionable postmortems.
- · Automate infrastructure, deployments, and operational tasks using modern IaC and scripting in collaboration with the Platform Engineering team.
- · Maintain and improve monitoring, alerting, and observability (Grafana, Prometheus, New Relic, etc).
- · Manage, operate and recommend improvement of mo
- · Execute and continuously improve disaster recovery and business continuity plans.
- · Partner with platform engineering, QA, and development teams to ensure operational readiness.
- · Establish and maintain runbooks, operational standards, and reliability best practices.
- · Provide leadership, mentorship, and clear communication during both normal operations and incidents.
- · Optimize cloud and Kubernetes environments for reliability, performance, and scalability.
Requirements
- **Qualifications **
- · 8+ years in SRE, DevOps, or Platform Engineering roles; 2+ years in a senior or lead capacity.
- · Strong experience supporting production environments with strict SLAs and high uptime requirements.
- · Deep knowledge of Kubernetes, containers, and cloud-native infrastructure.
- · Proficiency in automation and scripting using Bash, Python, or Go.
- · Hands-on experience with CI/CD pipelines and release engineering in modern environments.
- · Expert-level familiarity with IaC tools (Terraform preferred).
- · Strong understanding of monitoring, alerting, logging, and observability tooling.
- · Experience implementing and managing GitOps workflows (ArgoCD or similar).
- · Demonstrated ability to lead incidents and communicate effectively with technical and non-technical stakeholders.
- · Solid understanding of disaster recovery planning, resilience practices, and system hardening.