Collect, clean, and preprocess user-generated text and image data for fine-tuning large models
Design and manage scalable data labeling pipelines, leveraging both crowdsourcing and in-house labeling teams
Build and maintain automated datasets for content moderation (e.g., safe vs unsafe content)
Collaborate with researchers and engineers to ensure datasets are high-quality, diverse, and aligned with model training needs
Requirements
Proven experience preparing datasets for machine learning or fine-tuning large models
Strong skills in data cleaning, preprocessing, and transformation for both text and image data
Hands-on experience with data labeling workflows and quality assurance for labeled data
Familiarity with building and maintaining moderation datasets (safety, compliance, and filtering)
Proficiency in scripting (Python, SQL) and working with large-scale data pipelines
Benefits
Flat structure & real ownership
Full involvement in direction and consensus decision making
Flexibility in work arrangement
High-impact role with visibility across product, data, and engineering
Top-of-market compensation and performance-based bonuses
Global exposure to product development
Lots of perks - housing rental subsidies, a quality company cafeteria, and overtime meals
Health, dental & vision insurance
Global travel insurance (for you & your dependents)
Unlimited, flexible time off
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
data cleaningdata preprocessingdata transformationdata labeling workflowsquality assurancePythonSQLlarge-scale data pipelinescontent moderationfine-tuning large models