
Staff Engineer – SRE, Retail & Pharmacy
CVS Health
full-time
Posted on:
Location Type: Remote
Location: Massachusetts • Texas • United States
Visit company websiteExplore more
Salary
💰 $118,450 - $284,280 per year
Job Level
About the role
- Implement and maintain comprehensive observability solutions, providing real-time insights into system performance and overall health
- Investigate and resolve incidents quickly during critical situations and perform root cause analysis
- Collaborate with cross-functional teams to build robust monitoring, alerting, and telemetry solutions
- Design and implement observability solutions tailored for edge computing environments
- Define and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and business KPIs
- Build and optimize dashboards, visualizations, and alerting systems
- Implement distributed tracing and log aggregation systems
- Collaborate with engineering teams to ensure applications and infrastructure at edge locations are designed with observability in mind
- Drive proactive identification of issues in edge facilities through advanced observability tools
- Lead incident postmortems and implement observability-driven improvements
- Develop and maintain tools, scripts, and automation to enhance observability pipelines
- Evaluate and integrate industry-standard observability tools
- Optimize observability data storage, retention, and querying
- Mentor and guide junior SREs and engineers on observability best practices
- Partner with solution, engineering, and business teams to align observability efforts with business objectives
- Stay current with emerging observability trends, tools, and methodologies
- Contribute to the development of observability standards, runbooks, and documentation
- Drive cost optimization for observability infrastructure while maintaining high-quality monitoring
Requirements
- 8+ years of experience in SRE, DevOps, or related technology roles
- 5+ years of experience in delivering software in a large-scale environment with reliability and resilience concepts (multi-region, multi-cloud, containerization, etc.)
- 5+ years of experience with observability and monitoring tools such as Splunk, Dynatrace, Datadog, Prometheus, Grafana, etc.
- 3+ years of experience with programming/scripting languages (e.g., Python, java) for automation and tooling in distributed environments
- 3+ years of experience on Cloud Technologies (AWS, Microsoft Azure, Google Cloud)
- 3+ years of experience with source control and continuous integration tools like Git/Stash, BitBucket, or Jenkins
- 2+ years of engineering team leadership or management experience
- Experience using customer feedback tools such as Quantum Metrics, Medalia, and Adobe Analytics
- Deep understanding of microservices architecture and cloud-native technologies
- Experience in configuring, supporting, and managing Rancher, Kubernetes, and/or Docker
- Experience in Incident Management, Change Management, Infrastructure Support, and Problem Management concepts and processes
- Excellent interpersonal and communication skills, including the ability to engage technical and non-technical stakeholders.
Benefits
- Affordable medical plan options
- 401(k) plan (including matching company contributions)
- Employee stock purchase plan
- No-cost programs for all colleagues including wellness screenings, tobacco cessation, and weight management programs
- Confidential counseling and financial coaching
- Paid time off
- Flexible work schedules
- Family leave
- Dependent care resources
- Colleague assistance programs
- Tuition assistance
- Retiree medical access
- Many other benefits depending on eligibility
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
observability solutionsroot cause analysismonitoring toolsalerting systemsdistributed tracinglog aggregationautomationprogramming languagescloud technologiesmicroservices architecture
Soft Skills
interpersonal skillscommunication skillsleadershipcollaborationmentoringproblem-solvingproactive identificationdocumentationstakeholder engagementincident postmortems