FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

VP, AI Reliability & Performance Architect
SynchronyVP, AI Reliability & Performance Architect responsible for reliability and performance in AWS-based AI ecosystem. Leading investigations and improvements in agent workflows and system reliability.
Tech Stack
Tools & technologiesAWSCloudPythonRaySplunkTerraform
About the role
Key responsibilities & impact- Ensure the production-grade reliability, accuracy, and performance of our AWS-based agentic AI ecosystem
- Lead investigations of complex agent/AI workflow failures using logs, metrics, and traces
- Improve the quality and performance of Retrieval-Augmented Generation (RAG) and agent workflows
- Establish and oversee evaluation approaches for models, RAG, and agents
- Partner with InfoSec/AppSec to review architectures and ensure designs follow enterprise security patterns
- Work with Governance teams to implement and monitor guardrails and controls across the AI platform
- Drive 'Design for Reliability' patterns across both Platform and Agent Building teams
- Translate reliability risks, performance trends, and operational metrics into clear business language for senior leaders, risk, and product owners
- Coach DevLeads and architects on debugging agent behaviors, strengthening observability pipelines, improving orchestration, and hardening production deployments
Requirements
What you’ll need- Bachelor's degree in Computer Science, Engineering, Information Systems, or related field (or equivalent experience)
- 10–14 years of IT experience including meaningful roles in application development, platform engineering, SRE/operations, and/or architecture or in lieu of a degree 12–16 years of IT experience including meaningful roles in application development, platform engineering, SRE/operations, and/or architecture
- Strong experience operating and improving reliability of cloud-native systems (AWS preferred; comparable cloud experience acceptable)
- Experience supporting AI/ML systems is beneficial, but not mandatory if you demonstrate strong troubleshooting ability
- Strong ability to script/build tooling in Python (or similar language) for reliability automation, analysis, testing, and operational workflows
- Hands-on experience with observability practices and tools (CloudWatch/X-Ray/Splunk/New Relic or similar)
- Experience with Infrastructure-as-Code (Terraform preferred; similar tools acceptable)
- Working knowledge of identity and security patterns (OAuth2, SSO/federation, IAM roles/policies/SCP concepts)
- Proven ability to lead through influence, drive standards/guardrails, and align multiple agile teams in a matrixed environment
Benefits
Comp & perks- best-in-class employee benefits and programs that cater to work-life integration and overall well-being
- career advancement and upskilling opportunities, focusing on Advancing Diverse Talent to take up leadership roles
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AWSPythonInfrastructure-as-CodeTerraformAI/ML systemsReliability automationObservability practicesCloudWatchX-RaySplunk
Soft Skills
leadershipinfluencecommunicationcoachingcollaborationtroubleshootingalignmentorganizational skillsproblem-solvingagile methodologies