Tech Stack
AWSAzureCloudGoogle Cloud PlatformGRPCPythonReact
About the role
- Design and develop robust, stateful, and scalable voice-first AI agents using Python, specifically optimized for real-time voice interactions, managing turn-taking, interruptions, and low-latency responses.
- Integrate best-in-class real-time Speech-to-Text (STT), Text-to-Speech (TTS), and Voice Activity Detection (VAD) services to create a seamless conversational flow.
- Connect voice agents with existing enterprise systems, databases, and third-party APIs to create powerful, end-to-end automated workflows initiated and managed through voice.
- Establish and own the Evals for voice agent performance and behavior and iterate over time to systematically improve performance, reliability, and the overall user experience.
- Build end-to-end conversational flows with reasoning, planning, and dynamic tool use — beyond pre-scripted voice experiences.
- Work cross-functionally with product managers, ML Scientists, and engineers to deeply understand user needs and voice interaction goals.
- Implement fallback, recovery, and error-handling strategies to deal with noisy audio input or speech recognition inaccuracies.
- Define and track voice-specific evaluation metrics (e.g., word error rate, latency, conversational naturalness).
- Develop observability tools and guardrails to monitor performance, ensure safety, and handle edge cases in spoken interactions.
- Document development, architecture decisions, and research findings to share knowledge across the team.
Requirements
- Strong experience building multi-step, tool-using agents (LangChain, Autogen).
- Familiar with prompt engineering, context management, and reasoning strategies like Chain-of-Thought and ReAct.
- Experience building low-latency, streaming voice applications.
- Expertise in integrating and managing real-time STT/TTS models and APIs.
- Proficient with techniques for Voice Activity Detection (VAD), noise suppression, and implementing robust barge-in/interruption logic.
- Experience with integrating third-party voice AI APIs, including Speech-to-Text (STT) and Text-to-Speech (TTS) services from providers like OpenAI, Deepgram, ElevenLabs, etc.
- Understanding of latency, timing, and streaming audio constraints.
- Comfortable connecting agents to external APIs, tools, databases in secure environments.
- Building RAG pipelines with vector stores, chunking strategies, and hybrid retrieval.
- Implementing and Using monitoring tools and evaluation frameworks (Braintrust) to score our AI Agents.
- Familiarity with techniques for prompt injection defense, guardrails (Rebuff, Guardrails AI), and failover logic.
- Token budget and latency management experience using caching, model routing, etc.
- Expert in Python, FastAPI, and LLM SDKs.
- Experience deploying AI apps to cloud platforms (AWS, GCP, Azure) using CI/CD best practices.
- Nice-to-have: M.S. / Ph.D. in Computer Science, NLP, Machine Learning, or related field.
- Nice-to-have: Background in spoken dialogue systems or conversational UX design.
- Nice-to-have: Familiarity with real-time streaming architecture (e.g., WebRTC, gRPC, socket.io).
- Nice-to-have: Multilingual ASR/TTS pipeline experience.