Tech Stack
CloudDockerGoGrafanaGRPCKubernetesPrometheusPythonTerraformWeb3
About the role
- Deploy, operate, and scale distributed clusters of OptimumP2P nodes, gateways, and relays across multi-cloud environments
- Build and maintain infrastructure-as-code (Terraform, Helm, Kubernetes) for rapid, reproducible deployments
- Define and monitor SLOs around propagation latency, attestation inclusion delay, and network reliability; implement automated failover and redundancy
- Develop monitoring pipelines (Prometheus, Grafana, OpenTelemetry) with actionable alerts and dashboards for network health and validator-facing KPIs
- Harden nodes, APIs, and relays against DDoS, replay attacks, and resource exhaustion; manage secrets and peer identities securely
- Maintain CI/CD build pipelines for services (multi-arch Docker builds, secure image publishing)
- Work closely with protocol engineers and researchers on large-scale experiments
- Create runbooks, perform post-mortems, and continuously improve operational reliability
- Write automation scripts, custom operators, and CLI tools to enhance infrastructure efficiency and debugging workflow
Requirements
- 5+ years of experience in DevOps, SRE, or Infrastructure Engineering
- Strong expertise in Kubernetes, Docker, and Terraform (multi-region, multi-cloud deployments)
- Proven track record in observability and monitoring (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
- Deep understanding of networking fundamentals (TCP/UDP, packet loss, NAT traversal, DDoS mitigation)
- Hands-on experience with CI/CD pipelines (GitHub Actions, GitLab CI)
- Security-first mindset: IAM, secrets management, infra hardening
- Experience writing automation scripts and tooling in Go, Python, or Bash
- Bonus: familiarity with libp2p, gRPC, and blockchain validator infrastructure