Senior Machine Learning Engineer, Synthetic Data, Document Understanding

ABBYY

. Design and implement pipelines that analyze real documents to inform high-fidelity synthetic data generation .

Posted 5/21/2026full-timeBangalore • 🇺🇸 United StatesSeniorWebsite

Tech Stack

Tools & technologies

CloudPythonPyTorch

About the role

Key responsibilities & impact

Design and implement pipelines that analyze real documents to inform high-fidelity synthetic data generation
Build generative systems capable of producing documents across diverse formats, layouts, and domains
Develop evaluation frameworks to ensure synthetic data maintains distributional fidelity and diversity
Research and apply generative modeling techniques suited for document AI training
Identify and mitigate quality issues to ensure synthetic data is effective for downstream model training
Partner with Modeling teams to measure the impact of synthetic data on model performance
Own the synthetic data generation track end-to-end, from architecture to quality validation
Drive architectural decisions balancing quality, diversity, scale, and cost efficiency
Define and maintain data quality metrics and generation dashboards
Collaborate closely with annotation teams to ensure compatibility with downstream pipelines
Contribute to roadmap planning alongside Principal-level leadership
Build scalable pipelines capable of generating millions of synthetic training examples
Implement post-processing, filtering, and validation mechanisms to remove low-quality outputs
Design cost-efficient workflows balancing compute, quality, and throughput
Develop monitoring systems to detect distribution shifts or quality degradation over time
Collaborate with Platform teams on compute orchestration, storage, and scheduling.

Requirements

What you’ll need

MS or PhD in Computer Science, Engineering, Mathematics, or related field
5+ years of experience in Machine Learning / AI, with focus on:
Generative models
Vision-Language Models (VLMs)
Synthetic data systems
Proven experience building and evaluating synthetic data pipelines for ML training
Strong background in data quality evaluation and statistical analysis
Deep expertise in Vision-Language Models and document understanding (layout, structure, semantics)
Strong knowledge of generative modeling for structured and semi-structured data
Understanding of what makes synthetic data valuable:
Distributional fidelity
Diversity
Realistic noise patterns
Domain coverage
Strong programming skills in Python with experience in PyTorch or similar frameworks
Experience evaluating data quality via automated metrics and downstream model impact
Familiarity with large-scale data pipelines, cloud environments, and experiment tracking
Proven ability to independently own complex technical workstreams
Strong collaboration across data, modeling, and platform teams
Ability to clearly communicate data quality and generation trade-offs
Data-driven mindset with strong attention to coverage gaps and quality signals.

Benefits

Comp & perks

Comprehensive medical, accidental, and life insurance
Weekly wellness sessions to support your physical and mental well-being
A generous paid time off policy

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

generative modelsVision-Language Modelssynthetic data systemsdata quality evaluationstatistical analysisPythonPyTorchdata pipelinescloud environmentsautomated metrics

Soft Skills

collaborationcommunicationdata-driven mindsetattention to detailindependent ownershiproadmap planningarchitectural decision-makingquality validationproblem-solvingimpact measurement

Certifications

MS in Computer SciencePhD in Computer ScienceEngineeringMathematics