FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesPython
About the role
Key responsibilities & impact- Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work:
- Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code
- Build and maintain scalable data pipelines for evaluation workflows
- Analyze model-generated code for correctness, reliability, and edge-case failures
- Construct structured evaluation scenarios across large repos and multi-language environments
- Provide detailed technical feedback on model performance and failure patterns
- Contribute to evaluation frameworks that set the bar for how coding ability is measured
- End result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved.
- AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones.
Requirements
What you’ll need- 4+ years of professional software engineering experience (non-negotiable)
- Expert Python — clean, performant, well-tested code
- Hands-on experience working in large, complex codebases
- Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines
- Strong command of Git and modern development workflows
- Track record at a high-growth tech company or top-tier software organization
- Strong written English communication.
- Identity verification: Applicants must verify identity and have valid documentation to work as an independent contractor.
Benefits
Comp & perks- Identity verification required for independent contractors in residence country
- Weekly payments via PayPal or Stripe
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Pythoncoding benchmarksevaluation pipelinesdata pipelinesmodel performance analysisdebuggingproduction-quality codemulti-language environmentsGitLLM coding benchmarks
Soft Skills
written communicationtechnical feedback
