FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesCloudDistributed SystemsGoJavaJavaScriptPuppeteerPythonRust
About the role
Key responsibilities & impact- Construct and maintain large-scale web crawlers across diverse domains.
- Design high-throughput, fault-tolerant systems for data collection, managing volumes ranging from millions to billions of URLs per day.
- Navigate anti-bot systems, rate limits, and dynamic, JavaScript-heavy websites.
- Develop robust pipelines for data cleaning, deduplication, filtering, and normalization.
- Build and maintain datasets specifically structured for research and machine learning model training.
- Monitor and optimize crawl performance, coverage, and data quality through rapid iteration.
- Collaborate with research teams to ensure data collection efforts align with modeling requirements.
- Optimize infrastructure to ensure cost-efficiency, low latency, and reliability.
Requirements
What you’ll need- Extensive programming experience in one or more of the following: Go, Rust, Python, Java, or C++.
- Proven experience in building web crawlers or large-scale data pipelines.
- Solid understanding of HTTP, networking protocols, and browser behavior.
- Familiarity with distributed systems and parallel processing techniques.
- Experience handling large datasets, ideally at the terabyte to petabyte scale.
- Demonstrated ability to debug and maintain systems within unstable or adversarial environments.
- Preferred Qualifications:
- Experience with NLP pipelines or dataset curation for machine learning.
- Familiarity with LLM pre-training data or retrieval systems.
- Practical experience with headless browsers (e.g., Playwright, Puppeteer, or Chrome DevTools Protocol).
- Knowledge of proxy systems, IP rotation, and large-scale request orchestration.
- Background in data quality evaluation or benchmarking.
- Experience running workloads on cloud or bare-metal infrastructure.
Benefits
Comp & perks- Impactful Opportunity: Contribute to the development of a web-scale crawler and knowledge graph at the forefront of AI data accessibility.
- High-Performance Culture: Join a lean, low-ego team that prioritizes high output and professional growth.
- Remote Work: This position is part of a fully remote team, offering flexibility and autonomy.
- Competitive Compensation: A package including a competitive salary, comprehensive benefits, and equity, commensurate with experience and the ability to operate at scale.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GoRustPythonJavaC++HTTPnetworking protocolsdistributed systemsparallel processingdata cleaning
Soft Skills
collaborationdebuggingproblem-solvingadaptability
