Tech Stack
ApacheAWSAzureCloudDockerGoogle Cloud PlatformHadoopHDFSJavaKafkaKubernetesMapReduceOpen SourcePySparkPythonScalaSpark
About the role
- Responsible for design and development of big data solutions
- Partner with domain experts, product managers, analyst, and data scientists to develop Big Data pipelines in Hadoop or Snowflake
- Responsible for delivering data as a service framework
- Responsible for moving all legacy workloads to cloud platform
- Work with data scientist to build Client pipelines using heterogeneous sources and provide engineering services for data science applications
- Ensure automation through CI/CD across platforms both in cloud and on-premises
- Ability to research and assess open source technologies and components to recommend and integrate into the design and implementation
- Be the technical expert and mentor other team members on Big Data and Cloud Tech stacks
- Define needs around maintainability, testability, performance, security, quality and usability for data platform
- Drive implementation, consistent patterns, reusable components, and coding standards for data engineering processes
- Convert SAS based pipelines into languages like PySpark, Scala to execute on Hadoop and non-Hadoop ecosystems
- Tune Big data applications on Hadoop and non-Hadoop platforms for optimal performance
- Evaluate new IT developments and evolving business requirements and recommend appropriate systems alternatives and/or enhancements to current systems
- Produces detailed analysis of issues and recommends actions
- Supervise day-to-day staff management issues, including resource management, work allocation, mentoring/coaching
- Appropriately assess risk when business decisions are made and drive compliance with applicable laws, rules and regulations
Requirements
- 10+ years of total IT experience
- 8+ years of experience with Hadoop (Cloudera)/big data technologies
- Advanced knowledge of the Hadoop ecosystem and Big Data technologies
- Hands-on experience with the Hadoop eco-system (HDFS, MapReduce, Hive, Pig, Impala, Spark, Kafka, Kudu, Solr)
- Experience on designing and developing Data Pipelines for Data Ingestion or Transformation using Java or Scala or Python
- Experience with Spark programming (pyspark or scala or java)
- Expert level building pipelines using Apache Spark
- Familiarity with core provider services from AWS, Azure or GCP, preferably having supported deployments on one or more of these platforms
- Hands-on experience with Python/Pyspark/Scala and basic libraries for machine learning is required
- Experience with containerization and related technologies (e.g. Docker, Kubernetes)
- Experience with all aspects of DevOps (source control, continuous integration, deployments, etc.)
- 1 year Hadoop administration experience preferred
- 1+ year of SAS experience preferred
- Proficient in programming in Java or Python with prior Apache Beam/Spark experience a plus
- System level understanding - Data structures, algorithms, distributed storage & compute
- Possess team management experience and have led a team of data engineers and analysts
- Experience in Snowflake or Delta lake is a plus
- Bachelor’s/University degree or equivalent experience, potentially Masters degree