Senior Manager, Site Reliability Engineering

Catalyst Brands India

Senior Manager overseeing Site Reliability Engineering teams for Catalyst Brands. Managing operations and driving team productivity for eCommerce and CRM platforms.

Posted 4/30/2026full-timeDallas • Texas • 🇺🇸 United StatesSenior💰 $103,500 - $172,500 per yearWebsite

Tech Stack

Tools & technologies

AnsibleApacheAWSAzureChefCloudDNSITSMJ2EEJavaJenkinsKafkaPythonRubyServiceNowSplunkTCP/IPTerraform

About the role

Key responsibilities & impact

Provide both technical and people leadership to Site Reliability Engineering (SRE) teams through regular one-on-one meetings, team syncs, and performance reviews.
Manage project execution by organizing cross-functional teams, assigning responsibilities, and tracking progress against defined schedules and milestones.
Assist in budgeting, workforce planning, hiring, and third-party contract negotiations to support team growth and operational goals.
Drive continuous improvements in platform reliability, stability, and performance by overseeing the deployment of fully automated telemetry, observability, and AI-driven monitoring solutions.
Lead the development and enhancement of intelligent alerting and automated incident response systems to improve service restoration speed and issue detection.
Collaborate with administrators and platform engineers on implementation decisions to ensure highly reliable infrastructure, systems, and integrations.
Document all changes in accordance with change control policies and documentation standards; identify risks and recommend corrective actions when necessary.
Provide advanced Incident Management and Problem Management support by analyzing telemetry data and system logs to identify, remediate, and prevent reliability issues.
Participate in on-call escalation support rotations in alignment with the 24/7/365 support model.
Act as the Escalation Manager/Critical Incident Manager during major incidents, guiding teams through structured and effective service recovery.
Communicate timely updates and incident reports to senior leadership during and after critical events.
Lead conversations and provide business and engineering support for both internal stakeholders and external customers.

Requirements

What you’ll need

10+ years of experience in global organizations, with a proven ability to communicate effectively across all levels—from executives to individual contributors.
5+ years of hands-on Site Reliability Engineering (SRE) experience, including platform automation, telemetry, observability, and self-healing systems.
Demonstrated leadership and collaboration in high-availability, mission-critical digital environments.
Should have strong support knowledge and understanding on retail ecommerce flow - Web and Mobile technologies.
Work with software engineers across scrum teams and performance engineering to ensure systems are meeting reliability and performance standards.
Hands-on experience with debugging, optimizing code and automation.
Identify opportunities to adopt innovative technologies and continuous improvement – Automation, Shift left, Self-Heal.
Extensive experience supporting and administering digital retail and eCommerce platforms with one of the Cloud providers (AWS/Azure/Google Cloud).
Demonstrated experience in application design, software development, testing and production support of Java-J2EE based eCommerce applications.
Practical experience monitoring and maintaining streaming platform technologies such as Apache Kafka.
Deep understanding of cloud-native architectures and platform operations.
Proficient with modern monitoring, logging, and telemetry tools including: New Relic, Splunk, ELK, Datadog, DynaTrace, Catchpoint, and AWS CloudWatch.
Hands-on experience designing and implementing automated health checks, observability pipelines, and self-healing solutions.
Strong experience with automation tools and frameworks, such as: Jenkins, Chef, Ansible, Terraform.
Expertise in scripting languages used for platform automation and diagnostics: PowerShell, Python, Ruby, AWK, SED, etc.
Advanced experience with public cloud platforms: Microsoft Azure and Amazon Web Services (AWS).
Solid understanding of networking fundamentals: TCP/IP, DNS, DHCP, WINS.
Advance experience with Content Delivery Networks (CDNs) such as Akamai and Cloudflare.
Experience using ITSM and collaboration platforms: Jira, BMC Remedy, ServiceNow.
Strong understanding of IT operations frameworks (e.g., ITIL, MOF).
Bachelor’s degree in computer science or related technical field.

Benefits

Comp & perks

🌐 Worldwide ❌ Jobs You've Hidden ⭐️ Saved Jobs ✅ Applied Jobs ✉️ Email Alerts 👤 Account Catalyst Brands India Website LinkedIn All Job Openings 501 - 1000 employees 👗 Fashion 🛒 Retail Fashion
Retail Catalyst Brands India is a consumer-focused holding company that brings together the heritage and operations of six apparel and lifestyle brands under a single organization. The company positions these brands to reach and serve diverse working families across America, emphasizing accessible fashion, casual lifestyle, and outdoor-inspired apparel while aiming to scale distribution, influence, and brand impact. Senior Manager, Site Reliability Engineering 🔥 19 minutes ago 🏢🏡 Dallas – Hybrid 💵 $103.5k - $172.5k / year ⏰ Full Time 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) Ansible Apache AWS Azure Chef Cloud DNS ITSM J2EE Java Jenkins Kafka Python Ruby ServiceNow Splunk TCP/IP Terraform Apply Now Find Hiring Managers Customize resume for this job Report problem ☆ Save ☑️ Mark as applied ❌ Hide 📋 Description
Provide both technical and people leadership to Site Reliability Engineering (SRE) teams through regular one-on-one meetings, team syncs, and performance reviews.
Manage project execution by organizing cross-functional teams, assigning responsibilities, and tracking progress against defined schedules and milestones.
Assist in budgeting, workforce planning, hiring, and third-party contract negotiations to support team growth and operational goals.
Drive continuous improvements in platform reliability, stability, and performance by overseeing the deployment of fully automated telemetry, observability, and AI-driven monitoring solutions.
Lead the development and enhancement of intelligent alerting and automated incident response systems to improve service restoration speed and issue detection.
Collaborate with administrators and platform engineers on implementation decisions to ensure highly reliable infrastructure, systems, and integrations.
Document all changes in accordance with change control policies and documentation standards; identify risks and recommend corrective actions when necessary.
Provide advanced Incident Management and Problem Management support by analyzing telemetry data and system logs to identify, remediate, and prevent reliability issues.
Participate in on-call escalation support rotations in alignment with the 24/7/365 support model.
Act as the Escalation Manager/Critical Incident Manager during major incidents, guiding teams through structured and effective service recovery.
Communicate timely updates and incident reports to senior leadership during and after critical events.
Lead conversations and provide business and engineering support for both internal stakeholders and external customers. 🎯 Requirements
10+ years of experience in global organizations, with a proven ability to communicate effectively across all levels—from executives to individual contributors.
5+ years of hands-on Site Reliability Engineering (SRE) experience, including platform automation, telemetry, observability, and self-healing systems.
Demonstrated leadership and collaboration in high-availability, mission-critical digital environments.
Should have strong support knowledge and understanding on retail ecommerce flow - Web and Mobile technologies.
Work with software engineers across scrum teams and performance engineering to ensure systems are meeting reliability and performance standards.
Hands-on experience with debugging, optimizing code and automation.
Identify opportunities to adopt innovative technologies and continuous improvement – Automation, Shift left, Self-Heal.
Extensive experience supporting and administering digital retail and eCommerce platforms with one of the Cloud providers (AWS/Azure/Google Cloud).
Demonstrated experience in application design, software development, testing and production support of Java-J2EE based eCommerce applications.
Practical experience monitoring and maintaining streaming platform technologies such as Apache Kafka.
Deep understanding of cloud-native architectures and platform operations.
Proficient with modern monitoring, logging, and telemetry tools including: New Relic, Splunk, ELK, Datadog, DynaTrace, Catchpoint, and AWS CloudWatch.
Hands-on experience designing and implementing automated health checks, observability pipelines, and self-healing solutions.
Strong experience with automation tools and frameworks, such as: Jenkins, Chef, Ansible, Terraform.
Expertise in scripting languages used for platform automation and diagnostics: PowerShell, Python, Ruby, AWK, SED, etc.
Advanced experience with public cloud platforms: Microsoft Azure and Amazon Web Services (AWS).
Solid understanding of networking fundamentals: TCP/IP, DNS, DHCP, WINS.
Advance experience with Content Delivery Networks (CDNs) such as Akamai and Cloudflare.
Experience using ITSM and collaboration platforms: Jira, BMC Remedy, ServiceNow.
Strong understanding of IT operations frameworks (e.g., ITIL, MOF).
Bachelor’s degree in computer science or related technical field. Apply Now 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score Similar Jobs Senior Site Reliability Engineer 🕒 6 days ago Employer Direct Healthcare 201 - 500 ⚕️ Healthcare Insurance 🏢 Enterprise ☁️ SaaS Website LinkedIn All Job Openings Senior Site Reliability Engineer managing Azure-based healthcare platform for Lantern. Defining SRE practices and ensuring system reliability and compliance. 🏢🏡 Dallas – Hybrid ⏰ Full Time 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) 🦅 H1B Visa Sponsor AWS Azure Google Cloud Platform Grafana Prometheus Python Terraform Senior DevOps Engineer – FedRAMP 🕒 April 7 Semperis 201 - 500 🔒 Cybersecurity 🏢 Enterprise ☁️ SaaS Website LinkedIn All Job Openings Senior DevOps Engineer working on deployment and operations of FedRAMP authorized products. Improve cloud infrastructure and collaborate with federal customers in a regulated environment. 🏢🏡 Dallas – Hybrid ⏰ Full Time 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) 🦅 H1B Visa Sponsor Azure Cloud Grafana Prometheus Terraform Vehicle Reliability Engineer 🕒 March 31 Waabi 51 - 200 🚗 Transport 🤖 Artificial Intelligence 🔧 Hardware Website LinkedIn All Job Openings Vehicle Reliability Engineer identifying and resolving issues for Waabi, a leader in Physical AI for autonomous transportation. Collaborating across teams to enhance vehicle reliability and performance. 🏢🏡 Dallas – Hybrid 💰 Venture Round on 2023-01 ⏰ Full Time 🟡 Mid-level 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) 🦅 H1B Visa Sponsor Linux DevOps Team Lead – FedRAMP 🕒 March 12 Semperis 201 - 500 🔒 Cybersecurity 🏢 Enterprise ☁️ SaaS Website LinkedIn All Job Openings DevOps Team Lead overseeing FedRAMP deployment and operations. Working in a team to enhance security and ensure compliance. 🏢🏡 Dallas – Hybrid ⏰ Full Time 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) 🦅 H1B Visa Sponsor Azure Cloud Grafana Prometheus Terraform Site Reliability Engineer 🕒 September 16, 2025 Digital Realty 1001 - 5000 Website LinkedIn All Job Openings Site Reliability Engineer at Digital Realty managing interconnection fabric, automation, and global network delivery. Maintain, deploy, monitor, and troubleshoot carrier-class network infrastructure. 🏢🏡 Dallas – Hybrid ⏰ Full Time 🟡 Mid-level 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) Ansible AWS Azure Cloud Jenkins Linux Oracle Python Switching Terraform View More DevOps Jobs 🌐 Worldwide Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com Search Search Jobs by country Search jobs by city Search jobs by job title Search entry-level jobs Search junior-level jobs Search senior-level jobs Search jobs by tech stack Search jobs by contract type Search remote internships Search remote part-time jobs Remote jobs Anywhere in the World Companies Hiring Anywhere in the World Companies Hiring Sales People Anywhere in the World Companies Hiring Software Engineers Anywhere in the World Resources Advice Tips for finding remote jobs Interview questions and answers Resume examples Cover letter examples Post a job Affiliates Privacy policy Terms of service Job board SEO course AI Apply Copilot OpenClaw job finder Jobs by Country Remote jobs anywhere in the world (Worldwide remote jobs) Remote jobs United States Remote jobs Australia Remote jobs Brazil Remote jobs Canada Remote jobs France Remote jobs Ireland Remote jobs Germany Remote jobs Netherlands Remote jobs Spain Remote jobs UK Popular Jobs Remote data analyst jobs Remote customer support jobs Remote executive assistant jobs Remote marketing jobs Remote product designer jobs Remote product manager jobs Remote project manager jobs Remote recruiter jobs Remote sales jobs Remote software engineer jobs Jobs by Type Remote full-time jobs Remote part-time jobs Remote contract jobs Remote internship jobs Remote entry-level jobs Remote jobs with no experience required Remote junior jobs (1-3 years of experience) Digital nomad jobs Remote jobs with no degree required Freelance remote jobs Temporary remote jobs Remote jobs hiring now Stay at home mom jobs

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Site Reliability Engineering (SRE)platform automationtelemetryobservabilityself-healing systemsJava-J2EEstreaming platform technologiesscripting languagesautomation toolscloud-native architectures

Soft Skills

leadershipcollaborationcommunicationproblem managementincident managementteam managementperformance reviewsbudgetingcross-functional team organizationstakeholder engagement

Certifications

Bachelor’s degree in computer science