
Data Infrastructure Engineer
MeshyAI
full-time
Posted on:
Location Type: Remote
Location: Remote • California • 🇺🇸 United States
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
AirflowAWSAzureCloudDistributed SystemsETLGoogle Cloud PlatformJavaPythonRayScalaSparkSQL
About the role
- Design, implement, and maintain distributed ingestion pipelines for structured and unstructured data (images, 3D/2D assets, binaries).
- Build scalable ETL/ELT workflows to transform, validate, and enrich datasets for AI/ML model training and analytics.
- Architect pipelines across cloud object storage (S3, GCS, Azure Blob), data lakes, and metadata catalogs.
- Optimize large-scale processing with distributed frameworks (Spark, Dask, Ray, Flink, or equivalents).
- Support preprocessing of unstructured assets (e.g., images, 3D/2D models, video) for training pipelines, including format conversion, normalization, augmentation, and metadata extraction.
- Maintain data lineage, reproducibility, and governance for datasets used in AI/ML pipelines.
Requirements
- 5+ years of experience in data engineering, distributed systems, or similar.
- Strong programming skills in Python (plus Scala/Java/C++ a plus).
- Solid skills in SQL for analytics, transformations, and warehouse/lakehouse integration.
- Proficiency with distributed frameworks (Spark, Dask, Ray, Flink).
- Familiarity with cloud platforms (AWS/GCP/Azure) and storage systems (S3, Parquet, Delta Lake, etc.).
- Experience with workflow orchestration tools (Airflow, Prefect, Dagster).
Benefits
- Competitive salary, benefits and stock options.
- 401(k) plan for employees.
- Comprehensive health, dental, and vision insurance.
- The latest and best office equipment.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
PythonScalaJavaC++SQLETLELTSparkDaskRay