Tech Stack
AirflowApacheAWSCloudGoogle Cloud PlatformHadoopNoSQLPySparkPythonSpark
About the role
- Design and develop scalable and efficient data pipelines within the Big Data ecosystem using Apache Spark and Apache Airflow; document new and existing pipelines and datasets to ensure clarity and maintainability.
- Demonstrate and implement data architecture and management practices across data pipelines, data lakes, and modern data warehousing, including virtual data warehouses and push-down analytics.
- Write clean, efficient, and maintainable code in Python to support data processing and platform functionality.
- Utilize cloud-based infrastructures (AWS/GCP) and their services, including compute resources, databases, and data warehouses; manage and optimize cloud-based data infrastructure.
- Develop and manage workflows using Apache Airflow for scheduling and orchestrating data processing jobs and create/maintain Airflow DAGs.
- Implement and maintain Big Data architecture including cluster installation, configuration, monitoring, security, resource management, maintenance, and performance tuning.
- Create detailed designs and proof-of-concepts (POCs) to enable new workloads and technical capabilities on the platform; collaborate with platform and infrastructure engineers to implement capabilities in production.
- Manage workloads and optimize resource allocation and scheduling across multiple tenants to fulfill SLAs.
- Participate in planning activities and collaborate with data science teams to enhance platform skills and capabilities.
Requirements
- Minimum 10+ years of hands-on experience in Big Data technologies, including a minimum of 3 year's experience working with Spark, Pyspark.
- Experience with Google Cloud Platform (GCP) is preferred, particularly with Dataproc, and at least 6 years of experience in cloud environments is required.
- Must have hands-on experience in managing cloud-deployed solutions, preferably on AWS, along with NoSQL and Graph databases.
- Prior experience working in a global organization and within a DevOps model is considered a strong plus.
- Exhibit expert-level programming skills in Python, with the ability to write clean, efficient, and maintainable code.
- Demonstrate familiarity with data pipelines, data lakes, and modern data warehousing practices, including virtual data warehouses and push-down analytics.
- Design and implement distributed data processing solutions using technologies like Apache Spark and Hadoop.
- Develop and manage workflows using Apache Airflow for scheduling and orchestrating data processing jobs; create and maintain Apache Airflow DAGs.
- Possess strong knowledge of Big Data architecture, including cluster installation, configuration, monitoring, security, resource management, maintenance, and performance tuning.