Senior Site Reliability Engineer, Observability

Chainlink Labs

full-time

Posted on: 9/5/2025

Location: 🇺🇸 United States

✨ AI Apply

Senior

AWSDistributed SystemsGoGrafanaJavaKubernetesOraclePackerPerlPrometheusPythonRubySplunkSwiftTerraformWeb3

About the role

Build and orchestrate Modern OTEL-based Observability Platform
Support multiple telemetry types, including metrics, logs and traces
Define and support modern governance in observability and problems at scale
Ensure reliability, security, and performance exceed defined SLAs
Work with engineers across the company to troubleshoot issues, deploy new products and services, increase velocity and decrease cognitive load
Lead the design and deployment of monitoring/observability services to detect and alert the team of needed action
Ingest, aggregate, transform, and utilize data from multiple sources in the real time data pipeline
Oversee the availability, performance, and supportability of observability infrastructure
Create processes around alert response operations to ensure reliable delivery of oracle data
Make recommendations to ensure sufficient metrics are collected to create alerts with every new feature release
Champion reliability and security by doing work correctly the first time

7+ years of relevant professional experience (devops, infrastructure, SRE, and/or platform teams)
Ability to develop software outside of the scope of typical infrastructure requirements and configurations
Experience programming in C, C++, Java, Python, Go, Perl, or Ruby
Expert knowledge in all aspects of designing, developing, and managing large real-time systems
Experience with monitoring and logging: exporting metrics using Prometheus, building Grafana dashboards, centralized logging solutions (ELK Stack, Splunk, Grafana Stack)
Experience with distributed systems and container orchestration; maintained or built Kubernetes clusters and deployed new services on them
Strong communication skills; able to give and receive constructive feedback and participate in planning meetings and code reviews
Excitement for blockchain, Web 3.0, and decentralized technologies (desired)
Experience running infrastructure in the blockchain/web3 space (desired)
Ability to scale systems sustainably through automation and advocate for reliability and velocity improvements (desired)
Experience working remotely in a distributed team (desired)
Strong desire to grow, improve, and automate services to reduce toil (desired)
Familiarity/proficiency with tools: AWS; Terraform/Terragrunt; Kubernetes, Calico, ArgoCD; Prometheus and Grafana; GitHub Actions; Packer