Sanas

Principal Data Engineer

Sanas

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Apply

Job Level

Lead

Tech Stack

AirflowAWSCloudGoogle Cloud PlatformKafkaPostgresRaySpark

About the role

  • Architect and lead development of large-scale data pipelines and data lakes to ingest, transform, and serve high-quality data for AI model training, product telemetry, and analytics.
  • Drive long-term data infrastructure strategy across streaming and batch, feature store extensions, Iceberg/Delta lake choices, metadata management, and lakehouse evolution.
  • Drive platform and infrastructure decisions, optimizing compute fleets (e.g., Ray, Spark clusters), orchestration tooling (Airflow, Dagster), and streaming stacks (Kafka, Flink).
  • Collaborate with AI research scientists, engineering leads, product, finance, marketing, and legal to align data architecture with business and regulatory requirements.
  • Advocate best practices in data governance, lineage, observability, testing, tooling, and disaster recovery across pipelines and data stores.
  • Act as a mentor and technical leader — review design and code, share patterns, elevate team capability, and support recruitment and hiring.
  • Drive build vs buy decisions for tools to implement data quality and observability solutions to achieve high data quality.
  • Shape technical vision, own strategic architecture decisions, and mentor a growing team of Data Engineers focused on delivering reliable and scalable data systems for Machine Learning at scale.

Requirements

  • 10+ years of experience in Data Engineering, Infrastructure, or ML Systems, with at least 2+ years in a technical leadership capacity.
  • Expertise in building distributed batch and real-time data systems
  • Expertise in Databases (like Postgres) and Data Lakes (like Snowflake, Databricks, and ClickHouse)
  • Experience using data processing frameworks like Spark, Flink, and Ray
  • Deep experience with cloud platforms (AWS/GCP), object storage (e.g., S3), and orchestrators like Airflow and Dagster
  • Strong knowledge of data lifecycle management, including privacy, security, compliance and reproducibility
  • Comfortable working in a fast-paced startup environment
  • Strategic mindset and proven ability to collaborate across engineering, ML and product teams to deliver infrastructure that scales with the business.
  • Familiarity with audio data and its unique challenges (large file sizes, time-series features, metadata handling) — strong plus
  • Experience with Voice AI models like ASR, TTS, and speaker verification
  • Familiarity with real-time data processing frameworks like Kafka, Flink, Druid and Pinot
  • Familiarity with ML workflows including MLOps, feature engineering, model training and inference
  • Experience with labeling tools, audio annotation platforms, or human-in-the-loop annotation pipelines