Software Engineering Manager – Site Reliability Center

PNC

Software Engineering Manager leading Site Reliability Engineering initiatives for PNC, focusing on operational excellence and team development.

Posted 6/26/2026full-timePittsburgh • Alabama, Arizona, Colorado, Pennsylvania, Texas • 🇺🇸 United StatesMid-LevelSenior💰 $100,100 - $204,490 per yearWebsite

Tech Stack

Tools & technologies

CassandraCloudElasticSearchETLKafkaLinuxMongoDBOracleRedisSQL

About the role

Key responsibilities & impact

Manage SRE and related Teams; lead, coach, and develop a team of SRE engineers; set clear goals, drive accountability, and foster a culture of ownership and excellence; partner with cross-functional stakeholders to align technology and business objectives; support talent development, performance management, and succession planning; encourage innovation, continuous learning, and DevOps/SRE best practices.
Lead incident management & remediation; manage and actively participate in end-to-end incident response for major (P1/P2) incidents; guide real-time triage, diagnostics, and troubleshooting across application, infrastructure, and network layers; ensure rapid execution of remediation actions and service restoration; provide clear, timely communication to stakeholders during incidents; oversee post-incident analysis, reporting, and documentation to drive improvements.
Provide technical leadership in production support; serve as an escalation point for complex production issues; guide troubleshooting across: applications, infrastructure (Linux/Windows), databases (Oracle, SQL), middleware and integrations; ensure efficient log, metric, and system analysis; oversee batch/ETL monitoring and recovery processes; foster strong collaboration across engineering, infrastructure, and vendor teams.
Drive problem management & root cause resolution; lead root cause analysis (RCA) efforts for major and recurring incidents; ensure ownership and resolution of problem records; drive permanent fixes and systemic improvements to eliminate repeat issues, identify trends and patterns to reduce risk and improve stability; partner with engineering teams to resolve code defects and system gaps and promote knowledge sharing via runbooks, knowledge articles, and error catalogs.
Oversee change management & release execution; ensure safe and compliant execution of production changes and releases; validate change readiness, testing, rollback strategies, and risk assessments; represent the team in CAB reviews, providing technical risk evaluation; oversee post-implementation reviews (CPIR) and ensure follow-through and drive improvements in change success rate and reduction in production defects.
Advance monitoring, alerting & observability; lead efforts to build and optimize monitoring, dashboards, and alerting frameworks, champion use of tools such as Dynatrace, BigPanda, Logscale, and enterprise platforms, improve signal-to-noise ratio through alert tuning; enable proactive issue detection before customer impact; strengthen event management and observability practices.
Champion resiliency, stability & availability; lead efforts to ensure high availability of critical systems; oversee disaster recovery, failover, and continuity testing; identify and eliminate single points of failure and drive improvements in MTTR, uptime, and service reliability.
Enable scalability & performance optimization; guide capacity planning and performance tuning strategies; ensure systems scale effectively under peak demand; partner with development teams for performance-driven design improvements; optimize system configurations to improve efficiency and throughput.
Lead a 24x7 production support model; manage team participation in a 24x7 on-call rotation; oversee engagement in incident bridges, war rooms, and escalations; support pod-based operating models aligned to key applications; ensure seamless handoffs and global support continuity.
Drive Automation & Operational Efficiency; identify and prioritize opportunities to reduce manual effort through automation; implement automation across: Incident remediation, monitoring and alerting, deployment and validation, promote standardized runbooks and automation frameworks and improve operational metrics and reduce toil.
Ensure Governance, Risk & Compliance; maintain adherence to enterprise policies and regulatory standards; support audits, vulnerability remediation, and risk controls; ensure accurate documentation and operational procedures and champion security, access management, and data governance practices.

Requirements

What you’ll need

5 + years of related experience and 3+ years of management experience.
Strong experience in Site Reliability Engineering, Production Support, or DevOps.
Proven ability to lead teams in high-availability, enterprise environments
Deep understanding of incident, problem, and change management frameworks
Hands-on knowledge of monitoring tools, cloud/infrastructure platforms, and automation
Experience improving system reliability, observability, and operational maturity
Strong communication skills with the ability to lead during high-pressure situations.
Experience with OCP under infrastructure (Linux/Windows, OCP), MongoDB, Cassandra under databases (Oracle, SQL, MongoDB, Cassandra) and working knowledge of Elasticsearch, Redis, MQ and Kafka is a plus.

Benefits

Comp & perks

medical/prescription drug coverage (with a Health Savings Account feature)
dental and vision options
employee and spouse/child life insurance
short and long-term disability protection
401(k) with PNC match
pension and stock purchase plans
dependent care reimbursement account
back-up child/elder care
adoption, surrogacy, and doula reimbursement
educational assistance, including select programs fully paid
a robust wellness program with financial incentives

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Site Reliability EngineeringProduction SupportDevOpsIncident ManagementProblem ManagementChange ManagementMonitoringAutomationPerformance OptimizationCapacity Planning

Soft Skills

LeadershipCoachingCommunicationCollaborationAccountabilityInnovationContinuous LearningCrisis ManagementTalent DevelopmentPerformance Management