Define, measure, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for cloud services and infrastructure components.
Lead efforts to continuously improve system availability, fault tolerance, and disaster recovery capabilities.
Ensure proactive incident detection, efficient root cause analysis, and timely resolution of production incidents.
Drive automation efforts to reduce manual intervention and streamline cloud infrastructure management.
Design and implement monitoring, logging, and alerting solutions to track cloud infrastructure health, performance, and security.
Ensure that cloud infrastructure is built with security best practices in mind and meets all relevant compliance and regulatory requirements.
Work closely with development, DevOps, and operations teams to ensure cloud infrastructure aligns with application and business requirements.
Lead the incident response efforts for cloud infrastructure-related issues, ensuring that all incidents are managed effectively.

Requirements

10 years of hands-on experience with cloud automation and configuration management tools (e.g., Terraform, AWS CloudFormation, Ansible, Puppet)
7+ years of experience in a Site Reliability Engineering (SRE), Infrastructure Engineering, or DevOps role, with at least 3+ years in a technical leadership capacity.
Deep knowledge of cloud services and technologies (e.g., EC2, S3, Lambda, Kubernetes, etc.)
Proficiency in scripting or programming languages (Python, Go, Bash, etc.)
Experience with monitoring, logging, and observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack)
Familiarity with Continuous Integration/Continuous Deployment (CI/CD) pipelines and cloud-native development practices.
Strong expertise in managing cloud infrastructure (AWS, Google Cloud, Azure) in production environments.
Experience with cloud-native architectures, microservices, and containerized environments (Kubernetes, Docker)
Strong understanding of cloud networking, storage, compute services, On-Prem and security best practices
Strong knowledge of Linux administration and internals
Effective communication skills, with the ability to translate technical concepts to non-technical stakeholders.

Benefits

Health insurance
401(k) matching
Flexible work hours
Paid time off
Remote work options
Employee assistance programs
Tuition assistance
Collaborative workspaces

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

cloud automationconfiguration managementSite Reliability EngineeringInfrastructure EngineeringDevOpscloud servicesscriptingcloud-native architecturesLinux administrationcloud networking

Soft Skills

technical leadershipeffective communication