Salary
💰 $190,800 - $267,100 per year
Tech Stack
AirflowKafkaSparkSQL
About the role
- Lead development of data pipelines and workflow for large scale ML models at Reddit.
- Design and implement scalable and secure data processing pipelines and storage environments that prepare our source of truth datasets for our models.
- Ensure data is cleansed, mapped, transformed, and otherwise optimized for storage and use according to business and technical requirements.
- Build effective data pipelines and workflows to streamline data ingestion, processing, and distribution tasks.
- Setting up and operating data workflow management tools for SQL code versioning, dependency tracing, etc
- Load transformed data into storage and reporting structures in destinations including data warehouse, reporting systems and analytics applications.
- Monitor and troubleshoot issues with the data environment to maintain high availability and performance.
- Support monitoring and observability across training datasets, model metrics and implement diagnostic tools for metric movements.
- Maintain effective documentation regarding data procedures, systems, and architectures to maintain clarity and enable easy collaboration.
Requirements
- 5+ years of experience in Data Engineering or ML Infrastructure
- Experience with large scale data transforms to prepare graph data
- Experience with Graph DB, Spark, Kafka pipelines
- Experience working with Airflow and MLFlow
- Experience with storage frameworks like BQ, parquet, iceberg
- Awareness of ML models and architectures is a huge plus.
- Strong focus on scalability, reliability, performance, and ease of use.
- Strong organizational & communication skills