Research Crawling Engineer

. Construct and maintain large-scale web crawlers across diverse domains.

Posted 4/28/2026full-timeRemote • New York • 🇺🇸 United StatesMid-LevelSenior💰 $80,000 - $175,000 per yearWebsite

Tools & technologies

CloudDistributed SystemsGoJavaJavaScriptPuppeteerPythonRust

Key responsibilities & impact

Construct and maintain large-scale web crawlers across diverse domains.
Design high-throughput, fault-tolerant systems for data collection, managing volumes ranging from millions to billions of URLs per day.
Navigate anti-bot systems, rate limits, and dynamic, JavaScript-heavy websites.
Develop robust pipelines for data cleaning, deduplication, filtering, and normalization.
Build and maintain datasets specifically structured for research and machine learning model training.
Monitor and optimize crawl performance, coverage, and data quality through rapid iteration.
Collaborate with research teams to ensure data collection efforts align with modeling requirements.
Optimize infrastructure to ensure cost-efficiency, low latency, and reliability.

What you’ll need

Extensive programming experience in one or more of the following: Go, Rust, Python, Java, or C++.
Proven experience in building web crawlers or large-scale data pipelines.
Solid understanding of HTTP, networking protocols, and browser behavior.
Familiarity with distributed systems and parallel processing techniques.
Experience handling large datasets, ideally at the terabyte to petabyte scale.
Demonstrated ability to debug and maintain systems within unstable or adversarial environments.
Preferred Qualifications:
Experience with NLP pipelines or dataset curation for machine learning.
Familiarity with LLM pre-training data or retrieval systems.
Practical experience with headless browsers (e.g., Playwright, Puppeteer, or Chrome DevTools Protocol).
Knowledge of proxy systems, IP rotation, and large-scale request orchestration.
Background in data quality evaluation or benchmarking.
Experience running workloads on cloud or bare-metal infrastructure.

Comp & perks

Impactful Opportunity: Contribute to the development of a web-scale crawler and knowledge graph at the forefront of AI data accessibility.
High-Performance Culture: Join a lean, low-ego team that prioritizes high output and professional growth.
Remote Work: This position is part of a fully remote team, offering flexibility and autonomy.
Competitive Compensation: A package including a competitive salary, comprehensive benefits, and equity, commensurate with experience and the ability to operate at scale.

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

GoRustPythonJavaC++HTTPnetworking protocolsdistributed systemsparallel processingdata cleaning

Soft Skills

collaborationdebuggingproblem-solvingadaptability