Design, implement, and maintain distributed ingestion pipelines for structured and unstructured data (images, 3D/2D assets, binaries).
Build scalable ETL/ELT workflows to transform, validate, and enrich datasets for AI/ML model training and analytics.
Architect pipelines across cloud object storage (S3, GCS, Azure Blob), data lakes, and metadata catalogs.
Optimize large-scale processing with distributed frameworks (Spark, Dask, Ray, Flink, or equivalents).
Support preprocessing of unstructured assets (e.g., images, 3D/2D models, video) for training pipelines, including format conversion, normalization, augmentation, and metadata extraction.
Maintain data lineage, reproducibility, and governance for datasets used in AI/ML pipelines.
Requirements
5+ years of experience in data engineering, distributed systems, or similar.
Strong programming skills in Python (plus Scala/Java/C++ a plus).
Solid skills in SQL for analytics, transformations, and warehouse/lakehouse integration.
Proficiency with distributed frameworks (Spark, Dask, Ray, Flink).
Familiarity with cloud platforms (AWS/GCP/Azure) and storage systems (S3, Parquet, Delta Lake, etc.).
Experience with workflow orchestration tools (Airflow, Prefect, Dagster).
Benefits
Competitive salary, benefits and stock options.
401(k) plan for employees.
Comprehensive health, dental, and vision insurance.
The latest and best office equipment.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.