Serve as a lead technical resource in triaging issues, maintaining observability and support tooling, and partnering with infrastructure and cloud teams to ensure continuity of service
Lead triage and resolution of incidents affecting GenAI platform availability or performance
Maintain and enhance observability tooling to ensure system health and performance
Collaborate with internal infrastructure teams (e.g., Google Cloud Platform support) to resolve platform level issues
Own diagnostics and root cause analysis for recurring platform incidents
Support and maintain internal GenAI-facing platforms, including Agent Space, ensuring uptime and reliability
Contribute to operational runbooks, automation scripts and service documentation
Mentor junior engineers and contribute to a culture of operational excellence
Requirements
5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of work experience, training, military experience, education
5+ years of experience in platform operations, SRE, or infrastructure engineering
3+ years of experience with observability tools (e.g., Prometheus, Grafana, Splunk)
3+ years of experience in incident management and root cause analysis in production environments
2+ years of experience supporting internal platforms or services used by engineering or ML teams
1+ year of experience collaborating across geographically distributed teams
Desired: 2+ years of experience working with cloud infrastructure platforms, preferably Google Cloud Platform
Desired: Experience with infrastructure-as-code tools (i.e., Terraform, Ansible)
Desired: Experience with containerized applications
Desired: Experience supporting GenAI tools
Benefits
Position offers a hybrid work schedule
Relocation assistance is not available for this position
This role is not eligible for Visa Sponsorship
Wells Fargo is an equal opportunity employer
Accommodation for applicants with disabilities is available upon request
Drug free workplace
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Software Engineeringplatform operationsSREinfrastructure engineeringobservability toolsincident managementroot cause analysisinfrastructure-as-codecontainerized applicationsGenAI tools