Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Auxilius.ai

AI Engineer – LLM Ops & Evaluation

Auxilius.ai

AI Engineer handling LLMOps pipeline and evaluation strategy for AI solutions in Governance, Risk and Compliance. Collaborating with a startup team and shaping AI operations.

Posted 5/16/2026full-timeMunich • 🇩🇪 GermanyMid-LevelSeniorWebsite

Tech Stack

Tools & technologies
Python

About the role

Key responsibilities & impact
  • Own the LLMOps pipeline: Evaluate infrastructure, prompt optimization loop, and the production integration that turns experiments into reliable customer-facing features
  • Design evaluation strategy per output type: Decide when to use deterministic evals (exact match, schema validation, embeddings) vs. LLM-as-judge, and build the rubrics, test datasets, and human-review loops that make the system trustworthy
  • Drive prompt engineering and optimization across all LLM operations in the product: Moving from hand-tuned prompts to a measurable, iterative process
  • Pick the right tool for each problem: Some things are LLM problems, some are embedding + classical NLP problems, some are deterministic logic
  • Run the production side of AI features: Observability (Langfuse /LangSmith / similar), cost and latency engineering, incident response when an LLM feature degrades
  • Build human-in-the-loop workflows: Review queues, feedback ingestion, labeling; so production signal feeds back into evals and prompt iteration
  • Mentor our AI & Analytics Intern and contribute to how we build the AI team over time

Requirements

What you’ll need
  • 3+ years of hands-on experience building and shipping ML/AI systems in production (we care more about what you've shipped than years on a CV)
  • Have shipped an LLM evaluation or prompt optimization pipeline, not just used LLMs in a project, but owned the loop
  • Strong hands-on experience with LLM-as-judge, including its variance problems and concrete techniques for controlling them
  • Solid foundation in classical NLP and ML ops: Embeddings, semantic similarity, entity matching, classification, fuzzy matching
  • Informed opinions on deterministic vs. LLM-based evals, from experience
  • Production judgment: You've owned cost and latency tradeoffs, observability, and incident response for an LLM-powered feature. You're familiar with prompt regression and have strategies for managing it
  • Strong Python
  • Excellent English communication, written and verbal: We discuss nuanced technical tradeoffs daily with the founding team and customers
  • Comfort with ambiguity: You can run experiments on real data, build intuition for this domain, and know when to stop iterating

Benefits

Comp & perks
  • Hands-on ownership of a real AI product used by enterprise customers
  • Work directly alongside the founding team from day one
  • Hybrid work model: Munich North, minimum one day per week in the office, otherwise flexible (open to strong candidates elsewhere in the EU for the right fit); onboarding will take in-office
  • A steep learning curve at the intersection of LLM engineering, enterprise GRC, and startup operations
  • The chance to shape the AI team as we grow

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
LLMOpsprompt engineeringML/AI systemsclassical NLPembeddingssemantic similarityentity matchingclassificationfuzzy matchingPython
Soft Skills
communicationmentoringcomfort with ambiguity