FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesPython
About the role
Key responsibilities & impact- You’ll be working on our data team focused on the quality of the datasets being delivered for training our models.
- This is a hands-on role where your #1 mission would be to improve the quality of the pretraining datasets by leveraging your previous experience, intuition and training experiments.
- This includes synthetic data generation and data mix optimization.
- You’ll closely collaborate with other teams like Pretraining, Postraining, Evals, and Product to define high-quality data needs that map to missing model capabilities and downstream use cases.
- Staying in sync with the latest research in the fields of dataset design and pretraining is key to success in this role.
- You will constantly lead original research initiatives through short, time-bounded experiments while deploying highly technical engineering solutions into production.
- With the volumes of data to process being massive, you'll have a performant distributed data pipeline together with a large GPU cluster at your disposal.
Requirements
What you’ll need- Strong machine learning and engineering background
- Experience with Large Language Models (LLM), including:
- Understanding of transformer architectures and how LLMs learn
- Data ablations and scaling laws
- Mid-training and Post-training techniques
- Training reasoning and agentic models
- Experience with evals tracking model capabilities (general knowledge, reasoning, math, coding, long-context, etc)
- Experience in building trillion-scale pretraining datasets, and familiarity with concepts like data curation, deduplication, data mixing, tokenization, curriculum, impact of data repetition, etc.
- Excellent programming skills in Python
- Strong prompt engineering skills
- Experience working with large-scale GPU clusters and distributed data pipelines
- Strong obsession with data quality
- Research experience:
- Author of scientific papers on any of the topics: applied deep learning, LLMs, source code generation, etc. - is a nice to have
- Can freely discuss the latest papers and descend to fine details
- Is reasonably opinionated
Benefits
Comp & perks- Fully remote work & flexible hours
- 37 days/year of vacation & holidays
- Health insurance allowance for you & dependents
- Company-provided equipment
- Well-being, always-be-learning & home office allowances
- Frequent team get togethers
- Diverse & inclusive people-first culture
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
machine learningLarge Language Modelstransformer architecturesdata ablationsscaling lawsmid-training techniquespost-training techniquesdata curationdeduplicationtokenization
Soft Skills
strong obsession with data qualityexcellent programming skillsstrong prompt engineering skillsresearch experienceability to discuss latest papersopinionated
