
Research Engineer – Evaluations, Applied AI
LILT
contract
Posted on:
Location Type: Remote
Location: Argentina
Visit company websiteExplore more
About the role
- Eval Architecture & Benchmarking: Design and implement automated and human-in-the-loop evaluation frameworks to measure model performance across multiple modalities (text, code, image, etc.).
- Calibration & Peer Review: Act as the Gold Standard reviewer for other engineers. You will calibrate their data generation and evaluation contributions, providing technical feedback to ensure scientific consistency and high-fidelity output.
- Frontier Sample Generation: Write and refine complex prompts and golden response pairs for frontier-model training, specifically focusing on edge cases in reasoning and multilingual contexts.
- Quality Control (End-to-End): Develop the logic for multi-modal QC checks, ensuring that high-volume data samples are correct across diverse domains and languages.
- Technical Mentorship: Bring new knowledge and best practices to our established delivery and forward-deployed engineering teams on model evaluations.
Requirements
- Education: B.S. in Computer Science, AI, or a related field or 5+ years of relevant experience in a high-growth AI/Research environment.
- Deep Technical Proficiency: Expert-level Python skills and hands-on experience with modern AI frameworks (PyTorch, Transformers, LangChain/LlamaIndex).
- Evaluation Experience: Experience building model evaluation suites (e.g., MMLU-style benchmarks, custom RAG metrics, or human-preference alignment).
- Domain Expertise: Deep understanding of RAG architectures, vector database retrieval logic, and agentic workflows. Experience with RLHF/RLAIF environments and the mechanics of preference signaling/reward modeling.
- Multimodal & Multilingual Rigor: Experience handling data quality at scale across different languages and modalities (images, video, or audio).
- Precision- and Quality-Orientation: You find bugs in model reasoning that others miss. You are comfortable being the final quality arbiter for technical deliverables that others produce.
Benefits
- Health insurance
- 401(k) matching
- Flexible work hours
- Paid time off
- Remote work options
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonPyTorchTransformersLangChainLlamaIndexmodel evaluation suitesMMLU-style benchmarkscustom RAG metricshuman-preference alignmentRAG architectures
Soft Skills
technical feedbackmentorshipquality orientationattention to detailscientific consistencycollaborationproblem-solvingcommunication