
Senior Consultant – SRE Architect
qode.world
full-time
Posted on:
Location Type: Hybrid
Location: Austin • Texas • United States
Visit company websiteExplore more
Job Level
Tech Stack
About the role
- Define and lead the enterprise observability strategy for end-to-end transaction traceability across distributed systems
- Architect scalable solutions leveraging tools such as Dynatrace, OpenTelemetry, ELK, Grafana, Datadog, Splunk, Jaeger
- Establish standardized frameworks for logging, metrics, tracing, and telemetry collection
- Design and implement dependency mapping and service topology visualization across complex ecosystems
- Provide architectural guidance for monitoring latency, throughput, and error rates across critical transaction paths
- Lead root cause analysis using distributed tracing and telemetry data to resolve systemic performance issues
- Partner with application and database teams to optimize system performance and scalability
- Drive adoption of performance engineering best practices across teams
- Define and implement resiliency strategies for business-critical transaction flows
- Architect fault-tolerant systems, including failover, redundancy, and self-healing mechanisms
- Lead and design chaos engineering initiatives to validate system resilience
- Establish and govern Service Level Objectives (SLOs) and Service Level Indicators (SLIs) aligned to business outcomes
- Act as a trusted advisor to engineering teams, architects, and leadership on observability and SRE best practices
- Define and enforce standards, policies, and governance models for monitoring and tracing
- Lead cross-functional initiatives to drive adoption of observability frameworks
- Mentor engineers and SRE teams, fostering a culture of continuous improvement and operational excellence
- Drive measurable improvements including:
- 30% reduction in MTTD and MTTR within the first year
- ≥70% root cause identification within 1 hour
- ≥90% proactive issue detection via monitoring systems
- Develop executive-level reporting on system health, reliability trends, and performance metrics
- Build reusable frameworks, accelerators, and playbooks for incident management and observability adoption
- Establish comprehensive documentation for transaction flows, system dependencies, and observability architectures
- Develop and standardize incident response playbooks and runbooks
- Lead training and enablement initiatives to scale observability expertise across teams
Requirements
- 10+ years of experience in SRE, Observability, or related roles, with a strong focus on architecture and strategy
- Deep hands-on expertise with observability platforms such as Dynatrace, ELK, Datadog, Splunk, OpenTelemetry, Jaeger
- Proven experience designing observability solutions in cloud environments (AWS, Azure, GCP)
- Strong understanding of microservices architecture, APIs, and distributed systems
- Proficiency in programming/scripting (e.g., Python, Go, Java) for automation and integration
- Demonstrated ability to lead cross-functional initiatives and influence technical direction
- Dynatrace Associate or Professional Certification
- Experience implementing OpenTelemetry standards at scale
- Strong background in chaos engineering and resiliency testing
- Familiarity with AIOps platforms and intelligent automation solutions
- Consulting experience or prior role as an architect / technical advisor
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
observabilityarchitecturedistributed systemsmicroservicesprogrammingautomationchaos engineeringresiliency testingincident managementmonitoring
Soft Skills
leadershipcross-functional collaborationmentoringinfluencingcontinuous improvementoperational excellencecommunicationstrategic thinkingproblem-solvingadvisory
Certifications
Dynatrace Associate CertificationDynatrace Professional Certification