Engineering Member – Pre-training, Data Research

poolside

Data role focused on improving dataset quality for AI model training at Poolside. Collaborate with teams to ensure high-quality datasets for large training volumes.

Posted 5/19/2026full-timeRemote • 🇪🇺 Anywhere in EuropeMid-LevelSeniorWebsite

Tech Stack

Tools & technologies

Python

About the role

Key responsibilities & impact

You’ll be working on our data team focused on the quality of the datasets being delivered for training our models.
This is a hands-on role where your #1 mission would be to improve the quality of the pretraining datasets by leveraging your previous experience, intuition and training experiments.
This includes synthetic data generation and data mix optimization.
You’ll closely collaborate with other teams like Pretraining, Postraining, Evals, and Product to define high-quality data needs that map to missing model capabilities and downstream use cases.
Staying in sync with the latest research in the fields of dataset design and pretraining is key to success in this role.
You will constantly lead original research initiatives through short, time-bounded experiments while deploying highly technical engineering solutions into production.
With the volumes of data to process being massive, you'll have a performant distributed data pipeline together with a large GPU cluster at your disposal.

Requirements

What you’ll need

Strong machine learning and engineering background
Experience with Large Language Models (LLM), including:
Understanding of transformer architectures and how LLMs learn
Data ablations and scaling laws
Mid-training and Post-training techniques
Training reasoning and agentic models
Experience with evals tracking model capabilities (general knowledge, reasoning, math, coding, long-context, etc)
Experience in building trillion-scale pretraining datasets, and familiarity with concepts like data curation, deduplication, data mixing, tokenization, curriculum, impact of data repetition, etc.
Excellent programming skills in Python
Strong prompt engineering skills
Experience working with large-scale GPU clusters and distributed data pipelines
Strong obsession with data quality
Research experience:
Author of scientific papers on any of the topics: applied deep learning, LLMs, source code generation, etc. - is a nice to have
Can freely discuss the latest papers and descend to fine details
Is reasonably opinionated

Benefits

Comp & perks

Fully remote work & flexible hours
37 days/year of vacation & holidays
Health insurance allowance for you & dependents
Company-provided equipment
Well-being, always-be-learning & home office allowances
Frequent team get togethers
Diverse & inclusive people-first culture

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

machine learningLarge Language Modelstransformer architecturesdata ablationsscaling lawsmid-training techniquespost-training techniquesdata curationdeduplicationtokenization

Soft Skills

strong obsession with data qualityexcellent programming skillsstrong prompt engineering skillsresearch experienceability to discuss latest papersopinionated