Tech Stack
AWSAzureCloudDockerERPKubernetesTerraform
About the role
- Own and manage cloud infrastructure and code deployment processes across all environments
- Partner with DevOps to consume CI pipelines and ensure seamless, reliable CD execution
- Oversee infrastructure provisioning and environment readiness using IaC and automation tools
- Ensure system reliability and compliance through OS patching and server upgrades
- Define and manage server and storage backup strategies to meet customer RPO/RTO targets
- Lead configuration and optimization of monitoring tools (New Relic, Uptime Robot, PagerDuty)
- Drive creation of dashboards, alerts, and automated reports for system health and performance
- Ensure visibility into system & application behavior across all customer environments
- Build and mentor a high-performing SRE team focused on ownership, accountability, and continuous improvement
- Collaborate with Engineering, DevOps, DBA and support teams to align reliability goals with product and customer needs
- Develop and enforce best practices for incident response, postmortems, and change management
- Serve as an escalation point for complex technical issues and customer concerns related to cloud infrastructure and services
- Monitor and report on key reliability metrics including system uptime, application performance, alert volumes, and severity-1 incidents
- Identify and eliminate toil through automation and process refinement
- Champion a culture of resilience, transparency, and proactive problem-solving
Requirements
- 6+ years in SRE, DevOps, or infrastructure engineering roles supporting SaaS environments
- 2+ years in leadership capacity
- Strong experience with cloud platforms (AWS and Azure)
- Experience with containers (Kubernetes, Docker)
- Experience with IaC tools (Terraform, CloudFormation)
- Deep understanding of CI/CD pipelines and deployment orchestration
- Hands-on experience with observability platforms and telemetry pipelines (e.g., New Relic, Uptime Robot, PagerDuty)
- Excellent communication and stakeholder management skills
- (Nice to have) Experience supporting single-tenant SaaS platform
- (Nice to have) Familiarity with ITIL or ticket-based deployment workflows
- (Nice to have) Background in performance tuning and capacity planning