Tech Stack
AirflowAWSCloudGoogle Cloud PlatformKafkaPostgresRaySpark
About the role
- Architect and lead development of large-scale data pipelines and data lakes to ingest, transform, and serve high-quality data for AI model training, product telemetry, and analytics.
- Drive long-term data infrastructure strategy across streaming and batch, feature store extensions, Iceberg/Delta lake choices, metadata management, and lakehouse evolution.
- Drive platform and infrastructure decisions, optimizing compute fleets (e.g., Ray, Spark clusters), orchestration tooling (Airflow, Dagster), and streaming stacks (Kafka, Flink).
- Collaborate with AI research scientists, engineering leads, product, finance, marketing, and legal to align data architecture with business and regulatory requirements.
- Advocate best practices in data governance, lineage, observability, testing, tooling, and disaster recovery across pipelines and data stores.
- Act as a mentor and technical leader — review design and code, share patterns, elevate team capability, and support recruitment and hiring.
- Drive build vs buy decisions for tools to implement data quality and observability solutions to achieve high data quality.
- Shape technical vision, own strategic architecture decisions, and mentor a growing team of Data Engineers focused on delivering reliable and scalable data systems for Machine Learning at scale.
Requirements
- 10+ years of experience in Data Engineering, Infrastructure, or ML Systems, with at least 2+ years in a technical leadership capacity.
- Expertise in building distributed batch and real-time data systems
- Expertise in Databases (like Postgres) and Data Lakes (like Snowflake, Databricks, and ClickHouse)
- Experience using data processing frameworks like Spark, Flink, and Ray
- Deep experience with cloud platforms (AWS/GCP), object storage (e.g., S3), and orchestrators like Airflow and Dagster
- Strong knowledge of data lifecycle management, including privacy, security, compliance and reproducibility
- Comfortable working in a fast-paced startup environment
- Strategic mindset and proven ability to collaborate across engineering, ML and product teams to deliver infrastructure that scales with the business.
- Familiarity with audio data and its unique challenges (large file sizes, time-series features, metadata handling) — strong plus
- Experience with Voice AI models like ASR, TTS, and speaker verification
- Familiarity with real-time data processing frameworks like Kafka, Flink, Druid and Pinot
- Familiarity with ML workflows including MLOps, feature engineering, model training and inference
- Experience with labeling tools, audio annotation platforms, or human-in-the-loop annotation pipelines