FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Manager – Site Reliability Engineering, SRE
Softcard (acquired by Google). Lead, mentor, and grow a high-performing team of Site Reliability Engineers, fostering a culture of ownership, continuous improvement, and operational excellence .
Tech Stack
Tools & technologiesCloudGoogle Cloud PlatformKubernetesSDLCTerraform
About the role
Key responsibilities & impact- Lead, mentor, and grow a high-performing team of Site Reliability Engineers, fostering a culture of ownership, continuous improvement, and operational excellence
- Implement and champion Site Reliability Engineering principles and DevOps best practices within the team to ensure service reliability, availability, and performance
- Define and track key SRE metrics such as service uptime, incident response and resolution times
- Drive automation efforts including CI/CD pipeline enhancements, infrastructure-as-code practices, and self-service infrastructure provisioning to increase deployment velocity while reducing manual toil
- Own and continuously improve observability practices including system monitoring, logging, alerting, and diagnostics to ensure rapid issue detection and resolution
- Participate in incident response processes including incident management, root cause analysis, post-mortems, and continuous improvement to enhance system resilience
- Partner closely with software engineering, product management, architecture, and security teams to embed reliability and security early in the software development lifecycle (SDLC)
- Oversee the management and scalability of cloud infrastructure environments, primarily on Google Cloud Platform (GCP), with a focus on Kubernetes, container orchestration, and hybrid cloud integrations
- Advocate for and apply best practices in performance tuning, capacity planning, and system design for high availability
- Develop and execute a long-term roadmap for our hybrid cloud platform, aligning with evolving business objectives and technology trends
- Establish and monitor key performance indicators (KPIs) service level indicators (SLIs) and service level objectives (SLOs) to drive system health and stability
Requirements
What you’ll need- Typically requires a bachelor's degree and 7 years of experience in a technology and/or software engineering role or an equivalent combination
- Proven experience working in large, complex enterprise environments (Fortune 500 or equivalent)
- Strong understanding and demonstrated implementation of Site Reliability Engineering (SRE) principles at scale
- Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform, and ArgoCD
- In-depth knowledge and practical experience with CI/CD pipelines and automation of software delivery
- Significant hands-on experience in Site Reliability Engineering or related roles focused on cloud infrastructure reliability
- Strong software engineering background with proficiency in infrastructure-as-code tools (e.g., Terraform, ArgoCD) and CI/CD automation
- Deep knowledge of cloud platforms, specifically Google Cloud Platform (GCP), Kubernetes, container orchestration, and cloud-native architecture
- Familiarity with monitoring and observability tools such as Dynatrace, Datadog, or equivalents
- Experience managing high-availability systems in 24/7 operational environments
- Ability to collaborate cross-functionally and drive alignment across engineering, product, and security teams
Benefits
Comp & perks- Health insurance
- Retirement plans
- Paid time off
- Flexible work arrangements
- Professional development
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability Engineering (SRE)DevOps best practicesinfrastructure-as-code (IaC)CI/CD pipelinescloud infrastructure reliabilityperformance tuningcapacity planningsystem designKubernetescloud-native architecture
Soft Skills
leadershipmentoringcollaborationcontinuous improvementproblem-solvingcommunicationcross-functional alignmentownershipoperational excellenceincident management