Tech Stack
AWSDistributed SystemsGoGrafanaJavaKubernetesOraclePackerPerlPrometheusPythonRubySplunkSwiftTerraformWeb3
About the role
- Build and orchestrate Modern OTEL-based Observability Platform
- Support multiple telemetry types, including metrics, logs and traces
- Define and support modern governance in observability and problems at scale
- Ensure reliability, security, and performance exceed defined SLAs
- Work with engineers across the company to troubleshoot issues, deploy new products and services, increase velocity and decrease cognitive load
- Lead the design and deployment of monitoring/observability services to detect and alert the team of needed action
- Ingest, aggregate, transform, and utilize data from multiple sources in the real time data pipeline
- Oversee the availability, performance, and supportability of observability infrastructure
- Create processes around alert response operations to ensure reliable delivery of oracle data
- Make recommendations to ensure sufficient metrics are collected to create alerts with every new feature release
- Champion reliability and security by doing work correctly the first time
Requirements
- 7+ years of relevant professional experience (devops, infrastructure, SRE, and/or platform teams)
- Ability to develop software outside of the scope of typical infrastructure requirements and configurations
- Experience programming in C, C++, Java, Python, Go, Perl, or Ruby
- Expert knowledge in all aspects of designing, developing, and managing large real-time systems
- Experience with monitoring and logging: exporting metrics using Prometheus, building Grafana dashboards, centralized logging solutions (ELK Stack, Splunk, Grafana Stack)
- Experience with distributed systems and container orchestration; maintained or built Kubernetes clusters and deployed new services on them
- Strong communication skills; able to give and receive constructive feedback and participate in planning meetings and code reviews
- Excitement for blockchain, Web 3.0, and decentralized technologies (desired)
- Experience running infrastructure in the blockchain/web3 space (desired)
- Ability to scale systems sustainably through automation and advocate for reliability and velocity improvements (desired)
- Experience working remotely in a distributed team (desired)
- Strong desire to grow, improve, and automate services to reduce toil (desired)
- Familiarity/proficiency with tools: AWS; Terraform/Terragrunt; Kubernetes, Calico, ArgoCD; Prometheus and Grafana; GitHub Actions; Packer