Salary
💰 $18 - $24 per hour
Tech Stack
BigQueryCloudETLPandasPythonSQL
About the role
- Design and run agentic experiments end to end: frame problems, define success criteria, and summarize results with recommendations
- Script data pulls from BigQuery, assemble representative datasets, and write robust Python for data processing and experiment automation
- Integrate and iterate prompts in code; execute runs, collect outputs, and perform cost/quality analysis
- Evaluate outputs with AI and programmatic checks, including error detection, terminology/style adherence, and human-in-the-loop checkpoints
- Partner with Production, Product, and Research on workflow trials; quantify tradeoffs in quality, speed, and cost
- Communicate findings in docs and presentations; open/update Jira tickets and share reproducible artifacts (datasets, scripts, prompts, dashboards)
- Contribute to prompt and experiment hygiene: versioning, datasets, eval suites, and guardrails
- Test agent capabilities in sandbox and provide structured feedback to Product/Engineering
Requirements
- 2–4 years in Data Science, Analytics, or Applied AI with demonstrable Python proficiency (pandas, data parsing, APIs, basic ETL)
- Hands-on experience with LLMs and prompt engineering across providers (e.g., OpenAI, Anthropic, Vertex/Gemini, Bedrock), including practical eval and iteration cycles
- Strong analytical rigor: can define success metrics, compare workflows, and reason clearly about quality/cost/speed tradeoffs in production settings
- SQL and data wrangling skills; experience with BigQuery or equivalent cloud data warehouse
- Clear written communication with exec-ready summaries and artifact links (reports, notebooks, Sheets, slides)
- Experience evaluating LLM systems with AI judges, scripted checks, or human sampling; familiarity with MQM/LQE or similar linguistics QA frameworks (nice-to-have)
- Knowledge of RAG, vector stores, and retrieval verification strategies (nice-to-have)
- Familiarity with agentic workflows or translation/linguistics domains (nice-to-have)
- Basic MLOps/experimentation tools and prompt versioning best practices (nice-to-have)