FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesCloudPythonPyTorch
About the role
Key responsibilities & impact- Design and implement pipelines that analyze real documents to inform high-fidelity synthetic data generation
- Build generative systems capable of producing documents across diverse formats, layouts, and domains
- Develop evaluation frameworks to ensure synthetic data maintains distributional fidelity and diversity
- Research and apply generative modeling techniques suited for document AI training
- Identify and mitigate quality issues to ensure synthetic data is effective for downstream model training
- Partner with Modeling teams to measure the impact of synthetic data on model performance
- Own the synthetic data generation track end-to-end, from architecture to quality validation
- Drive architectural decisions balancing quality, diversity, scale, and cost efficiency
- Define and maintain data quality metrics and generation dashboards
- Collaborate closely with annotation teams to ensure compatibility with downstream pipelines
- Contribute to roadmap planning alongside Principal-level leadership
- Build scalable pipelines capable of generating millions of synthetic training examples
- Implement post-processing, filtering, and validation mechanisms to remove low-quality outputs
- Design cost-efficient workflows balancing compute, quality, and throughput
- Develop monitoring systems to detect distribution shifts or quality degradation over time
- Collaborate with Platform teams on compute orchestration, storage, and scheduling.
Requirements
What you’ll need- MS or PhD in Computer Science, Engineering, Mathematics, or related field
- 5+ years of experience in Machine Learning / AI, with focus on:
- Generative models
- Vision-Language Models (VLMs)
- Synthetic data systems
- Proven experience building and evaluating synthetic data pipelines for ML training
- Strong background in data quality evaluation and statistical analysis
- Deep expertise in Vision-Language Models and document understanding (layout, structure, semantics)
- Strong knowledge of generative modeling for structured and semi-structured data
- Understanding of what makes synthetic data valuable:
- Distributional fidelity
- Diversity
- Realistic noise patterns
- Domain coverage
- Strong programming skills in Python with experience in PyTorch or similar frameworks
- Experience evaluating data quality via automated metrics and downstream model impact
- Familiarity with large-scale data pipelines, cloud environments, and experiment tracking
- Proven ability to independently own complex technical workstreams
- Strong collaboration across data, modeling, and platform teams
- Ability to clearly communicate data quality and generation trade-offs
- Data-driven mindset with strong attention to coverage gaps and quality signals.
Benefits
Comp & perks- Comprehensive medical, accidental, and life insurance
- Weekly wellness sessions to support your physical and mental well-being
- A generous paid time off policy
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
generative modelsVision-Language Modelssynthetic data systemsdata quality evaluationstatistical analysisPythonPyTorchdata pipelinescloud environmentsautomated metrics
Soft Skills
collaborationcommunicationdata-driven mindsetattention to detailindependent ownershiproadmap planningarchitectural decision-makingquality validationproblem-solvingimpact measurement
Certifications
MS in Computer SciencePhD in Computer ScienceEngineeringMathematics
