Senior Software Engineer – AI Evaluation & Benchmarks, Python

G2i Inc.

Senior Software Engineer contributing to AI model evaluation benchmarks and pipelines in a remote capacity. Requires expertise in Python and extensive software engineering experience.

Posted 5/14/2026contractRemote • Florida • 🇺🇸 United StatesSenior💰 $80 - $100 per hourWebsite

Tech Stack

Tools & technologies

Python

About the role

Key responsibilities & impact

Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work:
Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code
Build and maintain scalable data pipelines for evaluation workflows
Analyze model-generated code for correctness, reliability, and edge-case failures
Construct structured evaluation scenarios across large repos and multi-language environments
Provide detailed technical feedback on model performance and failure patterns
Contribute to evaluation frameworks that set the bar for how coding ability is measured
End result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved.
AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones.

Requirements

What you’ll need

4+ years of professional software engineering experience (non-negotiable)
Expert Python — clean, performant, well-tested code
Hands-on experience working in large, complex codebases
Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines
Strong command of Git and modern development workflows
Track record at a high-growth tech company or top-tier software organization
Strong written English communication.
Identity verification: Applicants must verify identity and have valid documentation to work as an independent contractor.

Benefits

Comp & perks

Identity verification required for independent contractors in residence country
Weekly payments via PayPal or Stripe

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Pythoncoding benchmarksevaluation pipelinesdata pipelinesmodel performance analysisdebuggingproduction-quality codemulti-language environmentsGitLLM coding benchmarks

Soft Skills

written communicationtechnical feedback