Design, implement, and maintain distributed ingestion pipelines for structured and unstructured data (images, 3D/2D assets, binaries).
Build scalable ETL/ELT workflows to transform, validate, and enrich datasets for AI/ML model training and analytics.
Architect pipelines across cloud object storage (S3, GCS, Azure Blob), data lakes, and metadata catalogs.
Optimize large-scale processing with distributed frameworks (Spark, Dask, Ray, Flink, or equivalents).
Implement partitioning, sharding, caching strategies, and observability (monitoring, logging, alerting) for reliable pipelines.
Support preprocessing of unstructured assets (e.g., images, 3D/2D models, video) for training pipelines, including format conversion, normalization, augmentation, and metadata extraction.
Collaborate with ML researchers to quickly adapt pipelines to evolving pretraining and evaluation needs.
Requirements
5+ years of experience in data engineering, distributed systems, or similar.
Strong programming skills in Python (plus Scala/Java/C++ a plus).
Solid skills in SQL for analytics, transformations, and warehouse/lakehouse integration.
Proficiency with distributed frameworks (Spark, Dask, Ray, Flink).
Familiarity with cloud platforms (AWS/GCP/Azure) and storage systems (S3, Parquet, Delta Lake, etc.).
Experience with workflow orchestration tools (Airflow, Prefect, Dagster).
Benefits
Competitive salary, benefits and stock options.
Comprehensive health, dental, and vision insurance.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.