Lead Software Engineer – ML, Agentic Workloads

Mara

full-time

Posted on: 12/17/2025

Location Type: Remote

Location: Remote • 🇺🇸 United States

✨ AI Apply

Senior

CloudGrafanaKubernetesPrometheusPythonPyTorchRay

About the role

Lead architecture and development of agentic platforms that integrate multiple models, tools, and knowledge sources into dynamic reasoning systems.
Evaluate and deploy foundation and open-source models (LLMs, vision, multimodal) using efficient inference strategies and fine-tuning where applicable.
Design and maintain prompt lifecycle pipelines with version control, testing, and CI/CD integration (“PromptOps”).
Build and optimize RAG systems—vector database configuration, retriever-generator orchestration, and embedding quality improvement.
Implement guardrail frameworks for content safety, hallucination control, and policy enforcement across agentic workflows.
Integrate and extend agentic frameworks (LangChain, LangGraph, CrewAI, AutoGen, or equivalent), both in code-based and visual orchestration environments.
Collaborate with data, product, and infrastructure teams to design scalable APIs and services that enable model-driven applications.
Define observability and evaluation metrics for model performance, latency, and behavior drift in production.
Drive best practices for secure AI development, privacy-preserving data handling, and governance of third-party model integrations.
Mentor engineers across ML, backend, and platform domains; champion continuous learning and experimentation.

8+ years of professional software engineering experience, including 3+ years in ML application development or AI platform engineering.
Proficiency in Python, with strong understanding of ML toolchains (PyTorch, Hugging Face, LangChain, MLflow, Ray, etc.).
Proven experience with model evaluation, fine-tuning, and deployment across cloud and on-prem environments.
Hands-on experience with RAG architectures and vector databases (Weaviate, Milvus, pgvector, LanceDB, FAISS).
Deep understanding of prompt design, orchestration, and versioning using CI/CD workflows and automated testing frameworks.
Familiarity with agentic systems, both code-driven and visual-builder interfaces (LangGraph Studio, Dust, Flowise, Relevance AI, etc.).
Strong knowledge of guardrail techniques (rule-based filters, policy evaluators, toxicity detection, grounding validation).
Experience deploying ML systems on Kubernetes and serverless environments with observability (Prometheus, Grafana, OpenTelemetry).
Solid understanding of API design, microservice architecture, and data pipeline integration.
Excellent communication and leadership skills, with ability to translate complex ML concepts into actionable engineering outcomes.

Benefits

Tip: use these terms in your resume and cover letter to boost ATS matches.

PythonML toolchainsPyTorchHugging FaceLangChainMLflowRayRAG architecturesvector databasesprompt design

communicationleadershipmentoringcollaborationcontinuous learningexperimentation