
Lead Software Engineer II, AI Operations
Best Egg
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $150,000 - $170,000 per year
Job Level
Tech Stack
About the role
- Deliver internal copilots and customer/agent-facing automations with clear SLAs, rollbacks, and observability from day one.
- Design ingestion, chunking, embeddings, indexing, hybrid search/rerank, and retrieval evaluation; track retriever quality via offline golden sets and online metrics.
- Design and implement scalable AWS architectures, including AWS AI features such as Bedrock, IAM, knowledge bases, secure secrets and policy enforcement, automated provisioning, and resource-usage governance as core platform capabilities.
- Add tracing, prompt/agent version lineage, eval dashboards, and regression alerts; establish golden datasets and canary tests.
- Enforce PII redaction, safety filters, role-based access, audit logs, and human-in-the-loop review paths to control quality and risk.
- Version and deploy prompts, tools, agents, and retrieval pipelines; support blue/green and shadow deploys with automatic rollback triggers.
- Cut run-rate spend through caching, truncation, batching, autoscaling, and model routing; establish clear unit economics per workflow.
- Provide templates, SDKs, and high-quality abstractions that let product teams ship safely without bespoke plumbing; improve developer experience.
- Build primarily in Python and Metaflow (Outerbounds); deploy on AWS (Bedrock + core services) and OpenAI; use Cursor in daily workflows; help evaluate and, when appropriate, run on Databricks.
- Participate in on-call, author runbooks, and remove single-thread risk for AI services; drive reliability and resilience akin to ML Ops.
Requirements
- 5–10 years of professional software engineering (or equivalent) with 2+ years building AI/LLM applications; portfolio of shipped AI projects (links to code, demos, or case studies).
- Demonstrated passion for relentless exploration of the latest AI models, frameworks, and tooling, ensuring constant adoption of state-of-the-art innovations in the workflow.
- Hands-on with some/all of OpenAI, Bedrock, Huggingface/Ollama/vLLM; MCP servers and function/tool calling, multi-turn orchestration, streaming, and prompt/version management.
- Practical experience designing and tuning retrieval systems (chunking, embeddings, hybrid search, reranking), integration with vector database, and measuring retrieval quality.
- Comfortable building APIs/services and simple UIs where needed; strong fundamentals in Python and modern packaging/testing.
- CI/CD, containers, cloud fundamentals (AWS), and runtime performance tuning; experience operating services in production.
- Metaflow (Outerbounds) preferred; Databricks familiarity is a plus; ability to integrate data/feature pipelines and schedule/operate flows.
- Tracing and logging, expertise in tools like Datadog, Dynatrace or Grafana where relevant for AI monitoring is essential.
- Comfortable optimizing latency/throughput/cost, and implementing guardrails for PII/safety/compliance.
- Partner effectively with data scientists, analysts, and engineers; promote best practices and high-leverage abstractions.
- Fine-tuning or distillation experience; Kubernetes or FastAPI exposure; familiarity with Snowflake or similar warehousing for retrieval sources.
Benefits
- Pre-tax and post-tax retirement savings plans with a competitive company matching program
- Generous paid time-off plans including vacation, personal/sick time, paid short-term and long-term disability leaves, paid parental leave, and paid company holidays
- Multiple health care plans to choose from, including dental and vision options
- Flexible Spending Plans for Health Care, Dependent Care, and Health Reimbursement Accounts
- Company-paid benefits such as life insurance, wellness platforms, employee assistance programs, and Health Advocate programs
- Other great discounted benefits include identity theft protection, pet insurance, fitness center reimbursements, and many more!
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonMetaflowAWSOpenAIHuggingfaceretrieval systemsCI/CDcontainersKubernetesFastAPI
Soft Skills
collaborationproblem-solvingcommunicationreliabilityresilienceexplorationbest practicesdeveloper experiencepartneringoptimizing