Salary
💰 $90,000 - $210,000 per year
Tech Stack
CloudKubernetesPython
About the role
- Build the Central AI Platform: Design and build a unified, resilient platform for deploying and serving AI features, including routing layer with provider fallbacks, circuit breakers, and cost/latency-aware model selection, central registry for versioning models and prompts, and robust CI/CD pipelines.
- Architect for Scale and Quality: Own end-to-end Retrieval-Augmented Generation (RAG) strategy; lead design of embedding pipelines, develop chunking strategies, implement hybrid search, manage index maintenance, and build/scale LLM evaluation tooling (golden sets, rubric-based scoring, LLM-as-judge with bias controls).
- Ensure Production Excellence: Instrument AI systems with deep observability (structured tracing, cost-attribution, latency metrics); define and uphold SLOs; create incident response runbooks; build safety guardrails for mission-critical AI services.
- Partnership with Infrastructure: Own LLM runtime, retrieval architecture (vector stores, indexing), evaluation frameworks, safety guardrails, prompt/model versioning, AI observability, and cost/latency optimization while collaborating with Infrastructure on cloud, networking, secrets, Kubernetes/GPU orchestration, and shared platform services.
- Cross-team collaboration: Act as force multiplier for product and AI teams to ship features faster, safer, and smarter; make strategic build-vs-buy decisions and influence rollout strategies, SLAs/SLOs, incident response, and capacity planning.
Requirements
- 5+ years of experience in software or machine learning engineering.
- At least 2 years in a role focused on building and operating production ML/LLM systems.
- Proven track record of shipping and scaling LLM-backed applications, with deep, hands-on expertise in the surrounding ecosystem.
- Expertise in modern LLM retrieval systems, including hands-on work with embedding pipelines, hybrid search, chunking strategies, and index maintenance.
- Demonstrated experience building robust LLM eval tooling (e.g., golden sets, rubric scoring, LLM-as-judge).
- Practical knowledge of building resilient LLM routing and orchestration layers, incorporating provider fallbacks, circuit breakers, and cost/latency-aware selection.
- Strong programming skills in Python and a history of building production-grade automation and services.
- Strategic mindset, comfortable making build-vs-buy decisions and designing systems for long-term reliability and cost efficiency.
- Nice-to-have: Reproducible training & fine-tuning: containerized, reproducible training jobs, experiment tracking (Weights & Biases, MLflow), dataset versioning, standardized evaluation harnesses (lm-eval, HELM).
- Nice-to-have: ML Serving & Orchestration: Kubernetes-native serving (KServe, Seldon), model servers (Triton), and workflow orchestrators.
- Nice-to-have: Vector Databases experience (OpenSearch, pgvector, Pinecone, Weaviate) at scale.
- Nice-to-have: Experience designing and building multi-step, tool-using agents (e.g., LangGraph).
- Nice-to-have: Security & Safety experience: red-teaming exercises, adversarial tests, and implementing robust safety filters.