
Staff Developer, AI Evaluation, Reliability
Caseware
full-time
Posted on:
Location Type: Remote
Location: Colombia
Visit company websiteExplore more
Job Level
About the role
- Own and evolve **evaluation strategy** for LLM- and agent-based systems, including golden datasets, rubric-based scoring, reference-free evaluations, regression testing, and A/B experimentation.
- Benchmark and analyze **foundation model performance** within Caseware’s domain, identifying capability gaps, failure modes, and opportunities for improvement.
- Lead the design and optimization of **Retrieval-Augmented Generation (RAG)** pipelines, including embeddings, retrieval strategies, reranking, and retrieval quality metrics.
- Design and maintain **feedback and evaluation pipelines** that connect real-world user behavior to measurable improvements in agent performance.
- Apply data science techniques to analyze agent behavior, diagnose reliability issues, detect drift, and surface systemic risks.
- Define and implement **guardrails** for agentic systems, including schema validation, content filtering, tool governance, and policy enforcement.
- Establish **approval gates, audit trails, and controlled rollout mechanisms** for AI and agent changes, including feature flags, staged deployments, and kill switches.
- Partner with Security and Data teams to embed **privacy-by-design** practices, including PII detection and masking, data minimization, and retention controls.
- Support and influence **SOC 2 and ISO 27001-aligned controls** across AI data flows, including access management, logging, and incident response.
- Act as a **Staff-level technical leader**, mentoring other engineers, shaping best practices, and raising the overall bar for AI reliability and evaluation across the organization.
Requirements
- Strong **data science foundation**, including Python, SQL, statistics, and experiment design.
- Deep hands-on experience with **LLMs**, prompting strategies, and agent reasoning patterns.
- Practical expertise with **embeddings, vector databases, retrieval metrics, and reranking approaches**.
- Proven experience designing or operating **evaluation frameworks for generative AI or agentic systems**, including automated and human-in-the-loop evaluation.
- Strong understanding of **AI reliability, safety, and governance**, including guardrails, validation, monitoring, and change control.
- Working knowledge of **privacy engineering principles** and familiarity with GDPR/CCPA concepts such as consent, purpose limitation, and data subject rights.
- Experience operating in **enterprise or regulated environments**, including contributions to SOC 2 / ISO 27001-aligned systems and processes.
- Ability to influence across teams, communicate clearly about complex AI trade-offs, and drive alignment without direct authority.
- **Strong English language communication and collaboration skills**
Benefits
- ¨Contrato a termino Indefinido¨ with all the legal benefits
- Prepaid Medicine
- Life insurance and funeral assistance
- Internet allowance
- Home office stipend
- Competitive compensation — above the market average
- 100% remote work environment and an excellent work-life balance
- Opportunity to work for a growing global SaaS leader company
- A culture that promotes independence, innovation, trust, and accountability
- Open space to be creative, innovative and strategize for the future
- Mentorship by highly experienced professional
- Budget for training, we want you to grow
- 5 Personal Time Off days per year
- Sick Leave Top up to total 100% of salary paid by the employer from Day 3 to 90.
- Recognition Award, additional paid time off in recognition of the corresponding year of service
- Upgrade vacation starting at 5 years of service
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonSQLstatisticsexperiment designLLMsembeddingsvector databasesretrieval metricsrerankingevaluation frameworks
Soft Skills
influencecommunicationcollaborationmentoringbest practicesalignment
Certifications
SOC 2ISO 27001