Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
poolside

Member of Engineering – Pre-training, Data Acquisition

poolside

Data Acquisition Engineer focused on web crawling for pre-training data collection. Working with a distributed team to build AI infrastructure for software development.

Posted 5/18/2026full-timeRemote • 🇺🇸 United StatesMid-LevelSeniorWebsite

Tech Stack

Tools & technologies
AWSCloudDistributed SystemsDockerKubernetesPython

About the role

Key responsibilities & impact
  • Design, build, and operate a large-scale web crawler responsible for acquiring all openly accessible data on the internet
  • Develop specialized deep crawlers targeting high-value sources to improve recall and coverage
  • In collaboration with data researchers, own a long-term road map for data acquisition
  • Build observability, monitoring, and debugging tooling to ensure reliability and transparency across crawl infrastructure
  • Collaborate with pre-training, post-training, and evaluations teams to align data acquisition priorities with model training needs
  • Build high-throughput ingestion pipelines for rapidly onboarding partner data and evaluating it for quality

Requirements

What you’ll need
  • Strong distributed systems background with proven experience building and operating large-scale infrastructure — data pipelines, web crawlers, or similar
  • Proficiency in Python, and comfortable optimizing performance and debugging complex systems under production conditions
  • Hands-on experience with web crawling or large-scale data extraction: understanding of HTTP protocols, distributed job queues, and data parsing at scale
  • Familiarity with cloud platforms (AWS) and container orchestration (Kubernetes, Docker) for deploying and managing high-throughput workloads
  • Awareness of the non-technical dimensions of internet-scale crawling: data privacy, robots.txt adherence, and responsible crawl practices
  • Nice to have:
  • Prior experience pre-training LLMs
  • Experience in building trillion-scale SOTA pre-training datasets
  • Experience translating research to production at scale

Benefits

Comp & perks
  • Fully remote work & flexible hours
  • 37 days/year of vacation & holidays
  • 16 weeks of flexible, full-pay parental leave
  • Health insurance allowance for you & dependents
  • Company-provided equipment
  • Well-being, always-be-learning & home office allowances
  • Frequent team get togethers
  • Diverse & inclusive people-first culture

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Pythondistributed systemsweb crawlingdata extractiondata pipelinesdata parsinghigh-throughput ingestionperformance optimizationdebuggingbuilding SOTA pre-training datasets
Soft Skills
collaborationcommunicationorganizational skillsproblem-solvingtransparency