Tech Stack
AirflowAWSCloudDockerGoGrafanaKubernetesPrometheusPythonTerraformTypeScript
About the role
- Build developer-facing tooling, SDKs, and wrappers that abstract away Kubernetes, workflow, and infra details
- Design and harden CI/CD pipelines with faster feedback, per-MR preview environments, and reduced flakiness
- Deliver dashboards and shared observability: cost, reliability, performance — consumable by engineers and leadership
- Partner with engineering teams to codify repeatable workflow patterns into libraries and templates
- Play a central role in rebuilding the workflow orchestration platform and design core primitives for AI workloads
- Build language-native wrappers and tooling around workflow features to make advanced orchestration accessible
- Implement secure-by-default systems: secrets management, auditability, least-privilege defaults, and bastions
- Define and enforce SRE practices — incident response, playbooks, SLIs/SLOs, backup/restore automation
- Shape the Platform team: standards, processes, and culture that emphasize self-service systems over tickets
Requirements
- Strong software engineering background — comfortable writing production-quality Python (and ideally Typescript/Go) code
- Experience building developer productivity tooling (SDKs, CLI tools, CI/CD pipelines, internal dashboards, preview envs)
- Practical Kubernetes fluency — design and debug workloads, tune autoscaling, apply common patterns
- Hands-on with Kubernetes and workflow platforms (Argo Workflows, Airflow, Prefect, Temporal, etc.)
- Proficient with infrastructure-as-code and containerization — Docker-based services and modern IaC tools (Terraform preferred)
- Solid grounding in cloud infrastructure (AWS preferred) — compute, storage, networking, IAM, cost controls
- Familiarity with observability stacks (Prometheus, Grafana, OpenCost, Loki/Tempo/OTel)
- Comfort operating in high-scale, compute-intensive environments (ML training, batch/parallel workloads)
- Bias toward building paved roads and abstractions that make other engineers faster
- Prior experience in a Platform or DevProd team at a startup (nice to have)
- Background in MLOps foundations — training pipelines, model artifact management, GPU workloads (nice to have)
- Familiarity with on-prem or hybrid deployment patterns (nice to have)
- Knowledge of compliance frameworks (SOC 2, ISO27001, ITAR, GDPR) and their infra implications (nice to have)
- Exposure to multi-tenant SaaS or Bring Your Own Cloud architectures (nice to have)
- Contributions to open-source infra/dev-tooling projects (nice to have)
- Primarily hiring within the US (occasional exceptions for exceptional talent)