Tech Stack
ApacheAWSDistributed SystemsETLPySparkPythonSparkSQL
About the role
- Develop Spark applications in AWS Databricks, utilizing Python, pySpark, SQL and to meet project requirements and data processing needs.
- Design and implement robust ETL pipelines using Apache Spark in Databricks, ensuring data integrity, efficiency, and scalability.
- Collaborate with cross-functional teams to understand business requirements and design solutions that leverage structured, semi-structured, and unstructured data effectively.
- Write high-quality code in a timely manner, adhering to coding standards, best practices, and established development processes.
- Utilize version control systems like Git to manage codebase and ensure seamless collaboration within the team.
- Merge and consolidate various data sets using Pyspark code, enabling streamlined data processing and analysis.
- Work with APIs to facilitate data ingestion from diverse sources and integrate data into the ecosystem.
- Apply expertise in Databricks delta lake to optimize data storage, query performance, and overall data processing efficiency.
- Demonstrate knowledge of application development life cycles and promote continuous integration/deployment practices for efficient project delivery.
- Perform query tuning, performance tuning, troubleshooting, and debugging for Spark and other big data solutions to enhance system efficiency and reliability.
- Exhibit expertise in database concepts and SQL to efficiently manipulate, process, and extract insights from complex datasets.
- Apply database engineering and design principles to ensure data infrastructure meets high standards of scalability, reliability, and performance.
- Leverage previous experience in handling large-scale distributed systems to deliver and operate data solutions efficiently.
- Demonstrate a successful track record of extracting value from extensive, disconnected datasets to drive data-driven decision-making.
Requirements
- A minimum of 8+ years of hands-on experience in Spark, with proficiency in either Python or pySpark.
- Databricks Certified Data Engineer Associate or Professional Certification preferred.
- Strong knowledge of the Databricks platform and previous experience working with it.
- Extensive experience with Apache Spark and a proven history of successful development in this environment.
- Proficiency in at least one programming language (Python, pySpark).
- Previous experience in ETL and data application development, coupled with expertise in version control systems like Git.
- Ability to write Pyspark code for data merging and transformation.
- Experience working with APIs for data ingestion and integration.
- Familiarity with Databricks delta lake and expertise in query optimization techniques.
- Sound understanding of application development lifecycles and continuous integration/deployment practices.
- Proven experience in query tuning, performance tuning, troubleshooting, and debugging Spark and other big data solutions.
- Solid knowledge of database concepts and SQL.
- Strong background in handling large and complex datasets from various sources and databases.
- Proficient understanding of database engineering and design principles.
- Required Security Clearance: US Citizenship and the ability to obtain and maintain an active Public trust or higher clearance, per contract requirements.