Tech Stack
AWSCloudDistributed SystemsGoGrafanaJavaKafkaKubernetesPrometheusPython
About the role
- Lead the end-to-end architecture and delivery of key observability platform components, focusing on reliability, scalability, and usability.
- Drive consistency and quality across all observability signals—logs, metrics, traces, and continuous profiling—building intuitive workflows for engineers.
- Serve as a technical advisor and mentor across the platform org, guiding design decisions and aligning cross-team efforts with long-term architectural goals.
- Go deep in one or more problem areas (e.g., high-cardinality telemetry, distributed tracing correlation, compute cost insights), while ensuring platform horizontal scalability.
- Collaborate with product teams, SREs, and developer experience groups to understand telemetry needs and integrate observability into core engineering workflows.
- Design and build developer-friendly tooling and APIs to support incident response, performance analysis, and platform debugging at scale.
- Leverage and optionally contribute to open-source standards like OpenTelemetry to ensure interoperability and extensibility.
- Champion a pragmatic approach to observability—balancing performance, cost, and user value across diverse engineering teams.
Requirements
- Proven expertise in building and scaling observability systems (e.g., logging platforms, metrics pipelines, tracing infrastructure, or profiling tools).
- Lead technical execution for major components of Twilio’s observability overhaul, including shift to centralized S3-based data lakes, OpenTelemetry instrumentation, and ClickHouse-backed query engines.
- Deep proficiency in at least one modern programming language (e.g., Go, Python, Java).
- Familiarity with high-cardinality data challenges and telemetry correlation techniques.
- Experience designing high-scale telemetry systems (e.g., Prometheus, ClickHouse, OpenTelemetry, Kafka, or equivalent).
- Solid understanding of distributed systems and the challenges of observability in complex, microservice-based environments.
- Experience with AWS, Kubernetes, and infrastructure-as-code tools.
- Provide architectural guidance and thought leadership across teams, helping to establish clear telemetry standards, efficient usage patterns, and scalable platform abstractions.
- Ability to make forward-looking technical decisions and lead others through ambiguity.