Research Engineer – Evaluations, Applied AI

LILT

contract

Posted on: 3/3/2026

Location Type: Remote

Location: Argentina

✨ AI Apply

About the role

Eval Architecture & Benchmarking: Design and implement automated and human-in-the-loop evaluation frameworks to measure model performance across multiple modalities (text, code, image, etc.).
Calibration & Peer Review: Act as the Gold Standard reviewer for other engineers. You will calibrate their data generation and evaluation contributions, providing technical feedback to ensure scientific consistency and high-fidelity output.
Frontier Sample Generation: Write and refine complex prompts and golden response pairs for frontier-model training, specifically focusing on edge cases in reasoning and multilingual contexts.
Quality Control (End-to-End): Develop the logic for multi-modal QC checks, ensuring that high-volume data samples are correct across diverse domains and languages.
Technical Mentorship: Bring new knowledge and best practices to our established delivery and forward-deployed engineering teams on model evaluations.

Education: B.S. in Computer Science, AI, or a related field or 5+ years of relevant experience in a high-growth AI/Research environment.
Deep Technical Proficiency: Expert-level Python skills and hands-on experience with modern AI frameworks (PyTorch, Transformers, LangChain/LlamaIndex).
Evaluation Experience: Experience building model evaluation suites (e.g., MMLU-style benchmarks, custom RAG metrics, or human-preference alignment).
Domain Expertise: Deep understanding of RAG architectures, vector database retrieval logic, and agentic workflows. Experience with RLHF/RLAIF environments and the mechanics of preference signaling/reward modeling.
Multimodal & Multilingual Rigor: Experience handling data quality at scale across different languages and modalities (images, video, or audio).
Precision- and Quality-Orientation: You find bugs in model reasoning that others miss. You are comfortable being the final quality arbiter for technical deliverables that others produce.

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PythonPyTorchTransformersLangChainLlamaIndexmodel evaluation suitesMMLU-style benchmarkscustom RAG metricshuman-preference alignmentRAG architectures

Soft Skills

technical feedbackmentorshipquality orientationattention to detailscientific consistencycollaborationproblem-solvingcommunication