Tech Stack
NumpyPandasPythonPyTorchScikit-LearnSparkSQL
About the role
- Ingesting, organizing, and maintaining large-scale training datasets from open-source resources and contract-specific artifacts
- Creating and managing data cataloging systems to ensure datasets are findable, accessible, and ready for ML training pipelines
- Designing and implementing data labeling workflows, including managing external labeling vendors and quality assurance processes
- Building and maintaining YOLO-style manifests and annotation formats for custom computer vision datasets
- Performing data cleaning, validation, and augmentation to ensure high-quality training data
- Conducting exploratory data analysis and generating insights about dataset characteristics, biases, and coverage gaps
- Supporting the ML research team with statistical analysis, experiment design, and model evaluation
- Developing data pipelines and automation tools for continuous data ingestion and processing
- Collaborating with ML engineers to optimize data loading and preprocessing for training efficiency
- Process incoming datasets from various sources, performing quality checks and organizing them into our data management system
- Create or review annotation schemas and coordinate with labeling teams to ensure consistent, high-quality labels
- Write Python scripts to clean, transform, and validate datasets for specific ML training requirements
- Analyze dataset statistics and create visualizations to identify potential issues or opportunities for improvement
- Collaborate with the ML research lead to design experiments and evaluate model performance across different data splits
- Document dataset characteristics, versioning, and lineage to maintain reproducibility and compliance
Requirements
- High standard of ethics, grit, integrity and moral character
- 5+ years of experience in data science, analytics, or related field with focus on ML data preparation
- Strong foundation in probability, statistics, and experimental design
- Bachelor's degree in Statistics, Mathematics, Computer Science, or related quantitative field (Master's preferred)
- Proficiency with Python data stack: Pandas, NumPy, Jupyter Notebooks, and data visualization libraries
- Experience with ML frameworks (PyTorch, Scikit-learn) and familiarity with training workflows
- Hands-on experience with computer vision datasets and annotation formats (COCO, YOLO, Pascal VOC)
- Experience managing data labeling projects and working with annotation tools (Label Studio, CVAT, or similar)
- Familiarity with open-source ML models and experience applying them to real-world problems
- Strong SQL skills and experience with data warehousing concepts
- Experience with version control (Git) and collaborative development practices
- Excellent communication skills for coordinating with technical and non-technical stakeholders
- Meticulous attention to detail and strong organizational skills for managing complex datasets
- Willingness to embrace the Startup Culture of moving fast, being insatiably curious, celebrating often, embracing uncertainty, and having a personal desire to improve other peoples’ lives
- Must be eligible to obtain a clearance with the U.S. government