Architect, develop, and maintain scalable, efficient, and fault-tolerant data pipelines using Python and PySpark.
Design pipeline workflows for batch and real-time data processing using orchestration tools like Apache Airflow or Azure Data Factory.
Implement automated data ingestion frameworks to extract data from structured, semi-structured, and unstructured sources such as APIs, FTP, and data streams.
Architect and optimize scalable Data Warehouse and Data Lake solutions using Snowflake, Azure Data Lake, or AWS S3.
Implement partitioning, bucketing, and indexing strategies for efficient querying and data storage management.
Develop ETL/ELT pipelines using tools like Azure Data Factory or Snowflake to handle complex data transformations and business logic.
Integrate DBT to automate data transformations, ensuring modularity and testability.
Ensure pipelines are optimized for cost-efficiency and high performance.
Write, optimize, and troubleshoot complex SQL queries for data manipulation, aggregation, and reporting.
Design and implement dimensional and normalized data models (star and snowflake schemas) for analytics use cases.
Deploy and manage data workflows on cloud platforms using services like AWS Glue, Azure Synapse Analytics, or Databricks.
Monitor resource usage and costs, implementing cost-saving measures such as data lifecycle management and auto-scaling.
Implement data quality frameworks to validate, clean, and enrich datasets.
Build self-healing mechanisms to minimize downtime and ensure reliability of critical pipelines.
Optimize Spark workflows by tuning executor memory and partitioning.
Conduct profiling and debugging of data workflows to identify and resolve bottlenecks.
Collaborate with data analysts, scientists, and stakeholders to define requirements and deliver usable datasets.
Maintain clear documentation for pipelines, workflows, and architectural decisions.
Conduct code reviews to ensure best practices in coding and performance optimization.
Requirements
Advanced skills in Python and PySpark for high-performance distributed data processing.
Proficient in creating data pipelines with orchestration frameworks like Apache Airflow or Azure Data Factory.
Strong experience with Snowflake, SQL Data Warehouse, and Data Lake architectures.
Ability to write, optimize, and troubleshoot complex SQL queries and stored procedures.
Deep understanding of building and managing ETL/ELT workflows using tools such as DBT, Snowflake, or Azure Data Factory.
Hands-on experience with cloud platforms such as Azure or AWS, including services like S3, Lambda, Glue, or Azure Blob Storage.
Proficient in designing and implementing data models, including star and snowflake schemas.
Familiarity with distributed processing systems and concepts such as Spark, Hadoop, or Databricks.
Experience with real-time data processing frameworks such as Kafka or Kinesis.
Certifications in Snowflake (good to have).
Cloud Certifications (Azure, AWS, GCP) (good to have).
Knowledge of data visualization platforms such as Power BI, Tableau, or Looker.
Strong teamwork, communication skills and intellectual curiosity.
Ability to identify, troubleshoot, and resolve complex data issues effectively.
Willingness to embrace new tools, technologies, and methodologies.
Innovative thinker with a proactive approach to overcoming challenges.
ATS Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.