Design & iterate prompts (system, tool/function-calling, task prompts) to boost voice AI agent success, reliability, and tone.
Build co-pilots for customers to author their own prompts: meta-prompted assistants that suggest structures, lint for risks, autocomplete tool schemas, critique drafts, and generate eval cases.
Work directly with customer feedback and conversation logs to identify failure modes; translate them into prompt changes, guardrails, and data improvements.
Build eval datasets (success labels, rubrics, edge cases, regressions) and run offline/online evaluations (A/B tests, canaries) to quantify impact.
Create Python utilities/services for prompt versioning, config-as-code, rollout/rollback, and guardrails (policies, refusals, redaction).
Partner with PM/Success to define success metrics (task completion, first-pass accuracy, cost, latency) and instrument dashboards/alerts.
Own LLM integration details: function/tool schemas, output parsing/validation (pydantic), retrieval-aware prompting, and fallback strategies.
Ensure privacy & compliance (PII handling, anonymization, regional data boundaries) in datasets and logs.
Share learnings via concise docs, playbooks, and internal demos.
Run a tight feedback loop with customers, turn real conversations into better prompts and eval datasets, and ship changes that measurably improve agent outcomes.
Requirements
Python: 3+ years writing clean, tested, production code (typing, pytest, profiling); experience building small services/APIs (FastAPI preferred).
Prompt Engineering: Hands-on experience designing system/tool prompts, meta-prompting, rubric graders, and iterative prompt tuning based on real user data.
LLM Integration: Comfortable with major APIs (OpenAI/Anthropic/Google/Mistral), function/tool calling, streaming, and robust output handling.
Evaluation Mindset: Ability to define measurable success, create labeled datasets, and run methodical experiments/A/B tests.
Product Sense: Comfortable talking with customers, turning qualitative feedback into shipped improvements.
Data Hygiene: Practical experience cleaning, labeling, and balancing datasets; awareness of privacy/PII constraints.
Nice-to-haves: Experience building prompt-authoring UIs/SDKs or internal tooling for prompt versioning and governance.
Nice-to-haves: Agentic frameworks & tooling: DSpy, MCP, LangGraph, LlamaIndex, Rasa; experience with agent/tool schemas and orchestration.
Nice-to-haves: Observability & eval tooling: Langfuse, LangSmith, Braintrust; building eval harnesses and experiment dashboards.
Nice-to-haves: RAG & vector stores: Qdrant/Weaviate/Pinecone and retrieval-aware prompting.