Staff Developer, AI Evaluation, Reliability

Caseware

full-time

Posted on: 2/18/2026

Location Type: Remote

Location: Colombia

Visit company website

Explore more

Software Engineer jobs

✨ AI Apply

Apply

Job Level

Lead

Tech Stack

Python SQL

About the role

Own and evolve **evaluation strategy** for LLM- and agent-based systems, including golden datasets, rubric-based scoring, reference-free evaluations, regression testing, and A/B experimentation.
Benchmark and analyze **foundation model performance** within Caseware’s domain, identifying capability gaps, failure modes, and opportunities for improvement.
Lead the design and optimization of **Retrieval-Augmented Generation (RAG)** pipelines, including embeddings, retrieval strategies, reranking, and retrieval quality metrics.
Design and maintain **feedback and evaluation pipelines** that connect real-world user behavior to measurable improvements in agent performance.
Apply data science techniques to analyze agent behavior, diagnose reliability issues, detect drift, and surface systemic risks.
Define and implement **guardrails** for agentic systems, including schema validation, content filtering, tool governance, and policy enforcement.
Establish **approval gates, audit trails, and controlled rollout mechanisms** for AI and agent changes, including feature flags, staged deployments, and kill switches.
Partner with Security and Data teams to embed **privacy-by-design** practices, including PII detection and masking, data minimization, and retention controls.
Support and influence **SOC 2 and ISO 27001-aligned controls** across AI data flows, including access management, logging, and incident response.
Act as a **Staff-level technical leader**, mentoring other engineers, shaping best practices, and raising the overall bar for AI reliability and evaluation across the organization.

Requirements

Strong **data science foundation**, including Python, SQL, statistics, and experiment design.
Deep hands-on experience with **LLMs**, prompting strategies, and agent reasoning patterns.
Practical expertise with **embeddings, vector databases, retrieval metrics, and reranking approaches**.
Proven experience designing or operating **evaluation frameworks for generative AI or agentic systems**, including automated and human-in-the-loop evaluation.
Strong understanding of **AI reliability, safety, and governance**, including guardrails, validation, monitoring, and change control.
Working knowledge of **privacy engineering principles** and familiarity with GDPR/CCPA concepts such as consent, purpose limitation, and data subject rights.
Experience operating in **enterprise or regulated environments**, including contributions to SOC 2 / ISO 27001-aligned systems and processes.
Ability to influence across teams, communicate clearly about complex AI trade-offs, and drive alignment without direct authority.
**Strong English language communication and collaboration skills**

Benefits

¨Contrato a termino Indefinido¨ with all the legal benefits
Prepaid Medicine
Life insurance and funeral assistance
Internet allowance
Home office stipend
Competitive compensation — above the market average
100% remote work environment and an excellent work-life balance
Opportunity to work for a growing global SaaS leader company
A culture that promotes independence, innovation, trust, and accountability
Open space to be creative, innovative and strategize for the future
Mentorship by highly experienced professional
Budget for training, we want you to grow
5 Personal Time Off days per year
Sick Leave Top up to total 100% of salary paid by the employer from Day 3 to 90.
Recognition Award, additional paid time off in recognition of the corresponding year of service
Upgrade vacation starting at 5 years of service

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PythonSQLstatisticsexperiment designLLMsembeddingsvector databasesretrieval metricsrerankingevaluation frameworks

Soft Skills

influencecommunicationcollaborationmentoringbest practicesalignment

Certifications

SOC 2ISO 27001