Manager – Site Reliability Engineering, SRE

Softcard (acquired by Google)

. Lead, mentor, and grow a high-performing team of Site Reliability Engineers, fostering a culture of ownership, continuous improvement, and operational excellence .

Posted 5/8/2026full-timeBirmingham • Alabama • 🇺🇸 United StatesSeniorLeadWebsite

Tech Stack

Tools & technologies

CloudGoogle Cloud PlatformKubernetesSDLCTerraform

About the role

Key responsibilities & impact

Lead, mentor, and grow a high-performing team of Site Reliability Engineers, fostering a culture of ownership, continuous improvement, and operational excellence
Implement and champion Site Reliability Engineering principles and DevOps best practices within the team to ensure service reliability, availability, and performance
Define and track key SRE metrics such as service uptime, incident response and resolution times
Drive automation efforts including CI/CD pipeline enhancements, infrastructure-as-code practices, and self-service infrastructure provisioning to increase deployment velocity while reducing manual toil
Own and continuously improve observability practices including system monitoring, logging, alerting, and diagnostics to ensure rapid issue detection and resolution
Participate in incident response processes including incident management, root cause analysis, post-mortems, and continuous improvement to enhance system resilience
Partner closely with software engineering, product management, architecture, and security teams to embed reliability and security early in the software development lifecycle (SDLC)
Oversee the management and scalability of cloud infrastructure environments, primarily on Google Cloud Platform (GCP), with a focus on Kubernetes, container orchestration, and hybrid cloud integrations
Advocate for and apply best practices in performance tuning, capacity planning, and system design for high availability
Develop and execute a long-term roadmap for our hybrid cloud platform, aligning with evolving business objectives and technology trends
Establish and monitor key performance indicators (KPIs) service level indicators (SLIs) and service level objectives (SLOs) to drive system health and stability

Requirements

What you’ll need

Typically requires a bachelor's degree and 7 years of experience in a technology and/or software engineering role or an equivalent combination
Proven experience working in large, complex enterprise environments (Fortune 500 or equivalent)
Strong understanding and demonstrated implementation of Site Reliability Engineering (SRE) principles at scale
Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform, and ArgoCD
In-depth knowledge and practical experience with CI/CD pipelines and automation of software delivery
Significant hands-on experience in Site Reliability Engineering or related roles focused on cloud infrastructure reliability
Strong software engineering background with proficiency in infrastructure-as-code tools (e.g., Terraform, ArgoCD) and CI/CD automation
Deep knowledge of cloud platforms, specifically Google Cloud Platform (GCP), Kubernetes, container orchestration, and cloud-native architecture
Familiarity with monitoring and observability tools such as Dynatrace, Datadog, or equivalents
Experience managing high-availability systems in 24/7 operational environments
Ability to collaborate cross-functionally and drive alignment across engineering, product, and security teams

Benefits

Comp & perks

Health insurance
Retirement plans
Paid time off
Flexible work arrangements
Professional development

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Site Reliability Engineering (SRE)DevOps best practicesinfrastructure-as-code (IaC)CI/CD pipelinescloud infrastructure reliabilityperformance tuningcapacity planningsystem designKubernetescloud-native architecture

Soft Skills

leadershipmentoringcollaborationcontinuous improvementproblem-solvingcommunicationcross-functional alignmentownershipoperational excellenceincident management