Implement and maintain comprehensive observability solutions, providing real-time insights into system performance and overall health
Investigate and resolve incidents quickly during critical situations and perform root cause analysis
Collaborate with cross-functional teams to build robust monitoring, alerting, and telemetry solutions
Design and implement observability solutions tailored for edge computing environments
Define and maintain Service Level Indicators (SLIs), Service Level Objectives (SLOs), and business KPIs
Build and optimize dashboards, visualizations, and alerting systems
Implement distributed tracing and log aggregation systems
Collaborate with engineering teams to ensure applications and infrastructure at edge locations are designed with observability in mind
Drive proactive identification of issues in edge facilities through advanced observability tools
Lead incident postmortems and implement observability-driven improvements
Develop and maintain tools, scripts, and automation to enhance observability pipelines
Evaluate and integrate industry-standard observability tools
Optimize observability data storage, retention, and querying
Mentor and guide junior SREs and engineers on observability best practices
Partner with solution, engineering, and business teams to align observability efforts with business objectives
Stay current with emerging observability trends, tools, and methodologies
Contribute to the development of observability standards, runbooks, and documentation
Drive cost optimization for observability infrastructure while maintaining high-quality monitoring

Requirements

8+ years of experience in SRE, DevOps, or related technology roles
5+ years of experience in delivering software in a large-scale environment with reliability and resilience concepts (multi-region, multi-cloud, containerization, etc.)
5+ years of experience with observability and monitoring tools such as Splunk, Dynatrace, Datadog, Prometheus, Grafana, etc.
3+ years of experience with programming/scripting languages (e.g., Python, java) for automation and tooling in distributed environments
3+ years of experience on Cloud Technologies (AWS, Microsoft Azure, Google Cloud)
3+ years of experience with source control and continuous integration tools like Git/Stash, BitBucket, or Jenkins
2+ years of engineering team leadership or management experience
Experience using customer feedback tools such as Quantum Metrics, Medalia, and Adobe Analytics
Deep understanding of microservices architecture and cloud-native technologies
Experience in configuring, supporting, and managing Rancher, Kubernetes, and/or Docker
Experience in Incident Management, Change Management, Infrastructure Support, and Problem Management concepts and processes
Excellent interpersonal and communication skills, including the ability to engage technical and non-technical stakeholders.

Benefits

Affordable medical plan options
401(k) plan (including matching company contributions)
Employee stock purchase plan
No-cost programs for all colleagues including wellness screenings, tobacco cessation, and weight management programs
Confidential counseling and financial coaching
Paid time off
Flexible work schedules
Family leave
Dependent care resources
Colleague assistance programs
Tuition assistance
Retiree medical access
Many other benefits depending on eligibility

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

observability solutionsroot cause analysismonitoring toolsalerting systemsdistributed tracinglog aggregationautomationprogramming languagescloud technologiesmicroservices architecture

Soft Skills

interpersonal skillscommunication skillsleadershipcollaborationmentoringproblem-solvingproactive identificationdocumentationstakeholder engagementincident postmortems