Chainlink Labs

Senior Site Reliability Engineer, Observability

Chainlink Labs

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Manual Apply

Job Level

Senior

Tech Stack

AWSDistributed SystemsGoGrafanaJavaKubernetesOraclePackerPerlPrometheusPythonRubySplunkSwiftTerraformWeb3

About the role

  • Build and orchestrate Modern OTEL-based Observability Platform
  • Support multiple telemetry types, including metrics, logs and traces
  • Define and support modern governance in observability and problems at scale
  • Ensure reliability, security, and performance exceed defined SLAs
  • Work with engineers across the company to troubleshoot issues, deploy new products and services, increase velocity and decrease cognitive load
  • Lead the design and deployment of monitoring/observability services to detect and alert the team of needed action
  • Ingest, aggregate, transform, and utilize data from multiple sources in the real time data pipeline
  • Oversee the availability, performance, and supportability of observability infrastructure
  • Create processes around alert response operations to ensure reliable delivery of oracle data
  • Make recommendations to ensure sufficient metrics are collected to create alerts with every new feature release
  • Champion reliability and security by doing work correctly the first time

Requirements

  • 7+ years of relevant professional experience (devops, infrastructure, SRE, and/or platform teams)
  • Ability to develop software outside of the scope of typical infrastructure requirements and configurations
  • Experience programming in C, C++, Java, Python, Go, Perl, or Ruby
  • Expert knowledge in all aspects of designing, developing, and managing large real-time systems
  • Experience with monitoring and logging: exporting metrics using Prometheus, building Grafana dashboards, centralized logging solutions (ELK Stack, Splunk, Grafana Stack)
  • Experience with distributed systems and container orchestration; maintained or built Kubernetes clusters and deployed new services on them
  • Strong communication skills; able to give and receive constructive feedback and participate in planning meetings and code reviews
  • Excitement for blockchain, Web 3.0, and decentralized technologies (desired)
  • Experience running infrastructure in the blockchain/web3 space (desired)
  • Ability to scale systems sustainably through automation and advocate for reliability and velocity improvements (desired)
  • Experience working remotely in a distributed team (desired)
  • Strong desire to grow, improve, and automate services to reduce toil (desired)
  • Familiarity/proficiency with tools: AWS; Terraform/Terragrunt; Kubernetes, Calico, ArgoCD; Prometheus and Grafana; GitHub Actions; Packer