Tech Stack
AirflowAmazon RedshiftApacheAWSAzureBigQueryCassandraCloudDockerGoogle Cloud PlatformHadoopHDFSJavaKafkaKubernetesPySparkPythonScalaSparkSQLYarn
About the role
- Design, build, and optimize large-scale data processing systems using Apache Spark (Batch and Streaming)
- Collaborate with data scientists, analysts, and engineers to ensure scalable, reliable, and efficient data solutions
- Design, develop, and maintain big data solutions using Apache Spark
- Build data pipelines for processing structured, semi-structured, and unstructured data from multiple sources
- Optimize Spark jobs for performance and scalability across large datasets
- Integrate Spark with various data storage systems (HDFS, S3, Hive, Cassandra, etc.)
- Implement data quality checks, monitoring, and alerting for Spark-based workflows
- Ensure security and compliance of data processing systems
- Troubleshoot and resolve data pipeline and Spark job issues in production environments
Requirements
- Bachelor’s degree in Computer Science, Engineering, or related field (Master’s preferred)
- 3+ years of hands-on experience with Apache Spark (Core, SQL, Streaming)
- Strong programming skills in Scala, Java, or Python (PySpark)
- Solid understanding of distributed computing concepts and big data ecosystems (Hadoop, YARN, HDFS)
- Experience with data serialization formats (Parquet, ORC, Avro)
- Familiarity with data lake and cloud environments (AWS EMR, Databricks, GCP DataProc, or Azure Synapse)
- Knowledge of SQL and experience with data warehouses (Snowflake, Redshift, BigQuery is a plus)
- Strong background in performance tuning and Spark job optimization
- Experience with CI/CD pipelines and version control (Git)
- Familiarity with containerization (Docker, Kubernetes) is an advantage
- Preferred: Experience with stream processing frameworks (Kafka, Flink)
- Preferred: Exposure to machine learning workflows with Spark MLlib
- Preferred: Knowledge of workflow orchestration tools (Airflow, Luigi)
- Ability to safely and successfully perform the essential job functions (sedentary work)
- Ability to conduct repetitive tasks on a computer, utilizing a mouse, keyboard, and monitor
- Remote work
- Reasonable accommodation for applicants (application.accommodations@cai.io)
ATS Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Apache SparkScalaJavaPythonHadoopYARNHDFSParquetORCAvro
Soft skills
collaborationtroubleshootingproblem-solvingperformance tuningdata quality checks
Certifications
Bachelor’s degree in Computer ScienceMaster’s degree in Computer Science