Principal Data Engineer

Sanas

full-time

Posted on: 9/9/2025

Location: California • 🇺🇸 United States

✨ AI Apply

Lead

AirflowAWSCloudGoogle Cloud PlatformKafkaPostgresRaySpark

About the role

Architect and lead development of large-scale data pipelines and data lakes to ingest, transform, and serve high-quality data for AI model training, product telemetry, and analytics.
Drive long-term data infrastructure strategy across streaming and batch, feature store extensions, Iceberg/Delta lake choices, metadata management, and lakehouse evolution.
Drive platform and infrastructure decisions, optimizing compute fleets (e.g., Ray, Spark clusters), orchestration tooling (Airflow, Dagster), and streaming stacks (Kafka, Flink).
Collaborate with AI research scientists, engineering leads, product, finance, marketing, and legal to align data architecture with business and regulatory requirements.
Advocate best practices in data governance, lineage, observability, testing, tooling, and disaster recovery across pipelines and data stores.
Act as a mentor and technical leader — review design and code, share patterns, elevate team capability, and support recruitment and hiring.
Drive build vs buy decisions for tools to implement data quality and observability solutions to achieve high data quality.
Shape technical vision, own strategic architecture decisions, and mentor a growing team of Data Engineers focused on delivering reliable and scalable data systems for Machine Learning at scale.

10+ years of experience in Data Engineering, Infrastructure, or ML Systems, with at least 2+ years in a technical leadership capacity.
Expertise in building distributed batch and real-time data systems
Expertise in Databases (like Postgres) and Data Lakes (like Snowflake, Databricks, and ClickHouse)
Experience using data processing frameworks like Spark, Flink, and Ray
Deep experience with cloud platforms (AWS/GCP), object storage (e.g., S3), and orchestrators like Airflow and Dagster
Strong knowledge of data lifecycle management, including privacy, security, compliance and reproducibility
Comfortable working in a fast-paced startup environment
Strategic mindset and proven ability to collaborate across engineering, ML and product teams to deliver infrastructure that scales with the business.
Familiarity with audio data and its unique challenges (large file sizes, time-series features, metadata handling) — strong plus
Experience with Voice AI models like ASR, TTS, and speaker verification
Familiarity with real-time data processing frameworks like Kafka, Flink, Druid and Pinot
Familiarity with ML workflows including MLOps, feature engineering, model training and inference
Experience with labeling tools, audio annotation platforms, or human-in-the-loop annotation pipelines