Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
G2i Inc.

Senior Software Engineer – AI Evaluation & Benchmarks, Python

G2i Inc.

. Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work: .

Posted 5/14/2026contractRemote • Florida • 🇺🇸 United StatesSenior💰 $80 - $100 per hourWebsite

Tech Stack

Tools & technologies
Python

About the role

Key responsibilities & impact
  • Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work:
  • Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code
  • Build and maintain scalable data pipelines for evaluation workflows
  • Analyze model-generated code for correctness, reliability, and edge-case failures
  • Construct structured evaluation scenarios across large repos and multi-language environments
  • Provide detailed technical feedback on model performance and failure patterns
  • Contribute to evaluation frameworks that set the bar for how coding ability is measured
  • End result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved.
  • AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones.

Requirements

What you’ll need
  • 4+ years of professional software engineering experience (non-negotiable)
  • Expert Python — clean, performant, well-tested code
  • Hands-on experience working in large, complex codebases
  • Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines
  • Strong command of Git and modern development workflows
  • Track record at a high-growth tech company or top-tier software organization
  • Strong written English communication.
  • Identity verification: Applicants must verify identity and have valid documentation to work as an independent contractor.

Benefits

Comp & perks
  • Identity verification required for independent contractors in residence country
  • Weekly payments via PayPal or Stripe

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Pythoncoding benchmarksevaluation pipelinesdata pipelinesmodel performance analysisdebuggingproduction-quality codemulti-language environmentsGitLLM coding benchmarks
Soft Skills
written communicationtechnical feedback