TWO95 International, Inc

DevOps Engineer / Site Reliability Engineer

TWO95 International, Inc

contract

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • **Job Title: Lead SRE (Site Reliability Engineer )**
  • **Location: Remote Work**
  • **Type: 6+ Month Contract to hire**
  • **Rate: $Open /hr.**
  • Pl forward updated resume to **deivy.malli****@two95intl.com** and include your rate requirement along with your contact details with a suitable time when we can reach you.
  • **Responsibilities **
  • · Own uptime, SLAs, and overall reliability of cloud infrastructure and kiosks platform.
  • · Lead incident response, root-cause analysis, and drive actionable postmortems.
  • · Automate infrastructure, deployments, and operational tasks using modern IaC and scripting in collaboration with the Platform Engineering team.
  • · Maintain and improve monitoring, alerting, and observability (Grafana, Prometheus, New Relic, etc).
  • · Manage, operate and recommend improvement of mo
  • · Execute and continuously improve disaster recovery and business continuity plans.
  • · Partner with platform engineering, QA, and development teams to ensure operational readiness.
  • · Establish and maintain runbooks, operational standards, and reliability best practices.
  • · Provide leadership, mentorship, and clear communication during both normal operations and incidents.
  • · Optimize cloud and Kubernetes environments for reliability, performance, and scalability.

Requirements

  • **Qualifications **
  • · 8+ years in SRE, DevOps, or Platform Engineering roles; 2+ years in a senior or lead capacity.
  • · Strong experience supporting production environments with strict SLAs and high uptime requirements.
  • · Deep knowledge of Kubernetes, containers, and cloud-native infrastructure.
  • · Proficiency in automation and scripting using Bash, Python, or Go.
  • · Hands-on experience with CI/CD pipelines and release engineering in modern environments.
  • · Expert-level familiarity with IaC tools (Terraform preferred).
  • · Strong understanding of monitoring, alerting, logging, and observability tooling.
  • · Experience implementing and managing GitOps workflows (ArgoCD or similar).
  • · Demonstrated ability to lead incidents and communicate effectively with technical and non-technical stakeholders.
  • · Solid understanding of disaster recovery planning, resilience practices, and system hardening.