Tech Stack
BigQueryGrafanaKafkaKubernetesPostgresPrometheusPythonTerraform
About the role
- Design, provision, and operate our core platform: multi-env/multi-region Kubernetes clusters; networking, security, and identities; storage (object store, tables/iceberg/“DuckLake”), and runtime for Python jobs, notebooks, and services.
- Build code abstractions: Python CLIs/SDKs, templates, and controllers that abstract infra into simple workflows.
- Define IaC standards (Terraform, Helm) for everything: clusters, apps, policies, and data runtimes.
- Design and implement platform control plane: end user secrets management, cost controls, tenant/resource limits, SSO, and scheduling.
- Ship observability as a product: metrics/logs/traces, golden dashboards, SLIs/SLOs, runbooks, and incident/postmortem practice.
- Work intensively with the CTO and software engineers in the team on all of the above.
Requirements
- 5+ years in Platform/SRE/DevInfra roles (or equivalent impact) building and running production systems.
- Strong Python used for automation/tooling (CLIs, bots, controllers/operators, SDKs).
- Deep Kubernetes experience (cluster ops, Helm/Kustomize, controllers/operators, container networking).
- Practical observability (Prometheus/Grafana/OpenTelemetry or similar), performance tuning, and incident response.
- Running Python data pipelines/apps in production (dlt, dbt, Polars, DuckDB, Iceberg) (nice-to-have).
- Storage & query engines: Parquet, Iceberg, DuckDB/MotherDuck, BigQuery/Snowflake/Postgres (nice-to-have).
- Eventing/streaming (Kafka/Pub/Sub), batch schedulers, or serverless Python runtimes (nice-to-have).
- Security & supply-chain hardening (images, SBOMs, policy-as-code, secret rotation) (nice-to-have).
- OSS contributions or demo-driven platform work you can show us (nice-to-have).