Tech Stack
AirflowAmazon RedshiftApacheAWSCloudETLGraphQLPandasPythonRaySpark
About the role
- Design, build, and maintain ETL/ELT pipelines to extract, transform, and load data from various sources into cloud-based data platforms
- Develop and manage data architectures, data lakes, and data warehouses on AWS (S3, Redshift, Glue, Athena)
- Collaborate with data scientists, analysts, and business stakeholders to ensure data accessibility, quality, and security
- Optimize performance of large-scale data systems and implement monitoring, logging, and alerting for pipelines
- Work with both structured and unstructured data, ensuring reliability and scalability
- Implement data governance, security, and compliance standards
- Continuously improve data workflows by leveraging automation, CI/CD, and Infrastructure-as-Code (IaC)
Requirements
- Hands-on expertise in AWS native data services: S3, Glue (Schema Registry, Data Catalog), Step Functions, Lambda, Lake Formation, Athena, MSK/Kinesis, EMR (Spark), SageMaker (including Feature Store)
- Experience designing and optimizing batch (Step Functions) and streaming (Kinesis/MSK) ingestion pipelines
- Deep understanding of data mesh principles, domain-oriented ownership, data-as-a-product, and federated governance
- Experience enabling self-service platforms, decentralized ingestion, and transformation workflows
- Advanced knowledge of schema enforcement, evolution, and validation (preferably AWS Glue Schema Registry/JSON/Avro)
- Proficiency with ELT/ETL stack: Spark (EMR), dbt, AWS Glue, and Python (pandas)
- Experience designing and supporting vector stores (OpenSearch), feature stores (SageMaker Feature Store), and integrating with MLOps/data pipelines for AI/semantic search and RAG workloads
- Familiarity with metadata, catalog, and lineage solutions (Glue Data Catalog, Collibra, Atlan, Amundsen, etc.)
- Knowledge of data security and compliance: row/column-level security (Lake Formation), KMS encryption, role-based access, AuthN/AuthZ standards (JWT/OIDC), GDPR/SOC2/ISO 27001-aligned policies
- Experience with pipeline orchestration (AWS Step Functions, Apache Airflow/MWAA) and monitoring (CloudWatch, X-Ray)
- API design experience for batch and real-time data delivery (REST, GraphQL)