Senior Data Engineer

Leega

Architect and evolve the datalake system for dynamic pricing and machine learning at Leega. Ensure data governance, quality, and responsiveness in a multi-tenant Lakehouse architecture.

Posted 6/10/2026full-timeRemote • 🇧🇷 BrazilSeniorWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

SQLPythonPySparkApache IcebergKafkaFlinkdbtAirflowCDCOLAP

Soft Skills

data governancedata qualitydata modelingownershipreliabilitycost efficiency

Tools & Technologies

S3Schema RegistryCube.jsClickHousePinotTrinoOpenMetadataLake FormationQdrantAI-assisted development

Industry Keywords

Lakehouse architecturereal-time ingestiondata lineagedata protection lawsemantic layersData Meshfederated queriescontrolled schema evolutionmetric definitionsbackfill

Tech Stack

Tools & technologies

AirflowApacheJavaScriptKafkaPySparkPythonSQL

About the role

Key responsibilities & impact

You will architect and evolve the datalake that is the company's data nervous system — the foundation that feeds, in real time, the dynamic pricing engine, ML models, and the group's business intelligence.
This is an ownership role: you define the multi-tenant Lakehouse architecture, from streaming to the semantic layer, and are responsible for its reliability, governance, and cost.
Design and evolve the data lake on Apache Iceberg over S3 — well-defined layers, partitioning and compaction, time-travel and support for DELETE/UPDATE for LGPD (Brazilian data protection law).
Build real-time ingestion (Kafka, Flink, CDC with Debezium) with controlled schema evolution (Schema Registry) and delivery guarantees.
Model the transformation layer in dbt and orchestrate batch and quality flows in Airflow, from crawler to backfill.
Maintain metric definitions in Cube.js — the single source that feeds BI and AI agents and ensures consistency across the company.
Operate federated and low-latency OLAP queries over the lake, with cost and access isolation by tenant and performant queries.
Ensure data testing, lineage and cost efficiency, keeping the platform reliable as it scales.

Requirements

What you’ll need

Strong command of SQL and query optimization in distributed environments (Minimum 5 years).
Python with solid experience in PySpark or distributed processing.
Orchestration (Airflow), ELT and dbt applied at scale (Minimum 4 years).
Streaming (Kafka, Flink) and Lakehouse architectures with Apache Iceberg (Minimum 3 years).
Strong understanding of data governance, quality, and modeling.
Comfortable with AI-assisted development (e.g., Claude Code).
CDC (Debezium) and low-latency OLAP (ClickHouse, Pinot, Trino/Athena).
Semantic layers (Cube.js, dbt) and Data Mesh architectures.
Governance and catalog tools (OpenMetadata, Lake Formation).
Vector databases (Qdrant) and data pipelines for ML.

Benefits

Comp & perks

Remote work
Project duration: 6 months, with possibility of extension or conversion to permanent employment.