
Model Evaluation QA Lead
Deepgram
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $180,000 - $230,000 per year
Job Level
About the role
- Model Evaluation Automation: Design, build, and maintain automated model evaluation pipelines that run against every candidate model before release. Implement objective and subjective quality metrics (WER, SER, MOS, latency/throughput) across STT, TTS, and STS product lines.
- Release Gate Integration: Embed model quality checkpoints into CI/CD and release pipelines. Define pass/fail criteria, build dashboards for model comparison, and own the go/no-go signal for model promotions to production.
- Agent & Model Evaluation Frameworks: Stand up and operate evaluation tooling (Coval, Braintrust, Blue Jay, custom harnesses) for end-to-end voice agent testing—covering accuracy, latency, turn-taking, and conversational quality and custom metrics across real-world scenarios.
- Active Learning & Data Ingestion Testing: Partner with the Active Learning team to validate data ingestion infrastructure, annotation pipelines, and retraining automation. Ensure data quality standards are met at every stage of the flywheel.
- Industry Benchmark Automation: Automate execution and reporting of industry-standard benchmarks (e.g., LibriSpeech, CommonVoice, internal production-traffic evals). Maintain reproducible benchmark environments and publish results for internal consumption.
- Language & Domain Validation: Build and maintain test suites for multi-language and domain-specific model validation. Design coverage matrices that ensure new languages and acoustic domains are systematically evaluated before GA.
- Retraining Automation Support: Validate the end-to-end retraining pipeline across all data sources—from data selection and preprocessing through training, evaluation, and promotion—ensuring automation reliability and correctness.
- Manual Test Feedback Loop: Design and operate human-in-the-loop evaluation workflows for subjective quality assessment. Build the tooling and processes that translate human feedback into actionable quality signals for the ML team.
Requirements
- 4–7 years of experience in QA engineering, ML evaluation, or a related technical role with a focus on predictive and generative model and data quality.
- Hands-on experience building automated test/evaluation pipelines for ML models and connecting software features.
- Strong programming skills in Python; experience with ML evaluation libraries, data processing frameworks (Pandas, NumPy), and scripting for pipeline automation.
- Familiarity with speech/audio ML concepts: WER, SER, MOS, acoustic models, language models, or similar evaluation metrics.
- Experience with CI/CD integration for ML workflows (e.g., GitHub Actions, Jenkins, Argo, MLflow, or equivalent).
- Ability to design and maintain reproducible benchmark environments across multiple model versions and configurations.
- Strong communication skills—you can translate model quality metrics into actionable insights for engineering, research, and product stakeholders.
- Detail-oriented and systematic, with a bias toward automation over manual process.
Benefits
- Medical, dental, vision benefits
- Annual wellness stipend
- Mental health support
- Life, STD, LTD Income Insurance Plans
- Unlimited PTO
- Generous paid parental leave
- Flexible schedule
- 12 Paid US company holidays
- Quarterly personal productivity stipend
- One-time stipend for home office upgrades
- 401(k) plan with company match
- Tax Savings Programs
- Learning / Education stipend
- Participation in talks and conferences
- Employee Resource Groups
- AI enablement workshops / sessions
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
QA engineeringML evaluationautomated test pipelinesPythonPandasNumPyCI/CD integrationmodel quality metricsbenchmark environmentsretraining automation
Soft Skills
strong communicationdetail-orientedsystematicbias toward automation