Caseware

Staff Developer, AI Evaluation, Reliability

Caseware

full-time

Posted on:

Location Type: Remote

Location: Colombia

Visit company website

Explore more

AI Apply
Apply

Job Level

Tech Stack

About the role

  • Own and evolve **evaluation strategy** for LLM- and agent-based systems, including golden datasets, rubric-based scoring, reference-free evaluations, regression testing, and A/B experimentation.
  • Benchmark and analyze **foundation model performance** within Caseware’s domain, identifying capability gaps, failure modes, and opportunities for improvement.
  • Lead the design and optimization of **Retrieval-Augmented Generation (RAG)** pipelines, including embeddings, retrieval strategies, reranking, and retrieval quality metrics.
  • Design and maintain **feedback and evaluation pipelines** that connect real-world user behavior to measurable improvements in agent performance.
  • Apply data science techniques to analyze agent behavior, diagnose reliability issues, detect drift, and surface systemic risks.
  • Define and implement **guardrails** for agentic systems, including schema validation, content filtering, tool governance, and policy enforcement.
  • Establish **approval gates, audit trails, and controlled rollout mechanisms** for AI and agent changes, including feature flags, staged deployments, and kill switches.
  • Partner with Security and Data teams to embed **privacy-by-design** practices, including PII detection and masking, data minimization, and retention controls.
  • Support and influence **SOC 2 and ISO 27001-aligned controls** across AI data flows, including access management, logging, and incident response.
  • Act as a **Staff-level technical leader**, mentoring other engineers, shaping best practices, and raising the overall bar for AI reliability and evaluation across the organization.

Requirements

  • Strong **data science foundation**, including Python, SQL, statistics, and experiment design.
  • Deep hands-on experience with **LLMs**, prompting strategies, and agent reasoning patterns.
  • Practical expertise with **embeddings, vector databases, retrieval metrics, and reranking approaches**.
  • Proven experience designing or operating **evaluation frameworks for generative AI or agentic systems**, including automated and human-in-the-loop evaluation.
  • Strong understanding of **AI reliability, safety, and governance**, including guardrails, validation, monitoring, and change control.
  • Working knowledge of **privacy engineering principles** and familiarity with GDPR/CCPA concepts such as consent, purpose limitation, and data subject rights.
  • Experience operating in **enterprise or regulated environments**, including contributions to SOC 2 / ISO 27001-aligned systems and processes.
  • Ability to influence across teams, communicate clearly about complex AI trade-offs, and drive alignment without direct authority.
  • **Strong English language communication and collaboration skills**
Benefits
  • ¨Contrato a termino Indefinido¨ with all the legal benefits
  • Prepaid Medicine
  • Life insurance and funeral assistance
  • Internet allowance
  • Home office stipend
  • Competitive compensation — above the market average
  • 100% remote work environment and an excellent work-life balance
  • Opportunity to work for a growing global SaaS leader company
  • A culture that promotes independence, innovation, trust, and accountability
  • Open space to be creative, innovative and strategize for the future
  • Mentorship by highly experienced professional
  • Budget for training, we want you to grow
  • 5 Personal Time Off days per year
  • Sick Leave Top up to total 100% of salary paid by the employer from Day 3 to 90.
  • Recognition Award, additional paid time off in recognition of the corresponding year of service
  • Upgrade vacation starting at 5 years of service
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonSQLstatisticsexperiment designLLMsembeddingsvector databasesretrieval metricsrerankingevaluation frameworks
Soft Skills
influencecommunicationcollaborationmentoringbest practicesalignment
Certifications
SOC 2ISO 27001