FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Principal Software Engineer – AI Platform, Production Engineering, Reliability
CVS HealthPrincipal Software Engineer leading production excellence for AI Platform at CVS Health. Drive operational readiness and observability for AI services with high availability and performance standards.
Tech Stack
Tools & technologiesAWSAzureCloudDistributed SystemsGoogle Cloud Platform
About the role
Key responsibilities & impact- Own and evolve production operations strategy for AI/ML platforms and services
- Define SLOs, SLIs, and error budgets for AI systems
- Lead root cause analysis (RCA) and drive systemic improvements post-incident
- Establish operational readiness standards for launching new AI capabilities
- Build frameworks for on-call excellence , incident response, and escalation
- Design and implement end-to-end observability systems across AI workloads
- Implement model observability (drift detection, data skew, performance degradation)
- Build internal platforms and tooling for automated incident detection and response
- Mentor senior engineers and influence cross-team architectural decisions
Requirements
What you’ll need- 10+ years in software engineering, production engineering, or SRE roles
- Deep experience operating large-scale distributed systems in production
- Proven track record building monitoring, observability, and alerting systems
- Strong expertise in incident management and production support models
- Experience working with cloud platforms (Azure, AWS, GCP)
Benefits
Comp & perks- Medical, dental, and vision coverage
- Paid time off
- Retirement savings options
- Wellness programs
- Other resources based on eligibility
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AI platformsML platformsSLOsSLIsroot cause analysisobservability systemsmodel observabilityincident detectionautomated responsedistributed systems
Soft Skills
mentoringinfluencingleadership