Tech Stack
AirflowAWSCloudDistributed SystemsDockerGoGoogle Cloud PlatformKubernetesMicroservicesPythonPyTorchRayScikit-LearnServiceNowTensorflowTypeScript
About the role
- Architect, design, and implement scalable AI platform services and components that enable intelligent agents and LLM-powered applications.
- Apply LLMs and other AI technologies directly to build and enhance Kandjiʼs intelligent features.
- Build LLM-powered agents that autonomously reason through complex workflows.
- Develop full-stack product surfaces (backend & frontend) that incorporate AI into user experiences.
- Create reusable infrastructure, frameworks, and services for experimentation, deployment, and monitoring of AI at scale.
- Implement scalable APIs and microservices that expose AI models and services to product teams.
- Integrate with multiple LLM providers and manage model selection, routing, and fallback strategies.
- Collaborate with Product Engineering, Data Science, frontend, and research teams to deliver secure, reliable AI experiences.
- Partner with core services infrastructure team to support execution, monitoring, logging, and automated evaluation of AI agents.
- Drive adoption of best practices in AI privacy, security, and compliance.
- Optimize platform performance, scalability, and cost-efficiency using cloud-native technologies and distributed systems.
Requirements
- Bachelorʼs or Masterʼs degree in Computer Science, Machine Learning, Data Science, or related field.
- Extensive experience designing and building scalable AI/ML platforms or infrastructure in a production environment.
- Proven track record of applying LLMs and AI models to real-world product features and user-facing solutions.
- Deep expertise in backend engineering, distributed systems, and cloud-native technologies (e.g., Kubernetes, Docker, AWS/GCP).
- Proficiency in agentic and orchestration frameworks (e.g., Langchain, CrewAI, pydantic-ai, LangGraph, Airflow, Kubeflow, Ray, or similar).
- Strong programming skills in Python, Go, TypeScript or similar languages and experience with ML frameworks (TensorFlow, PyTorch, scikit-learn).
- Experience with MLOps best practices, including model deployment, monitoring, logging, and automated evaluation.
- Demonstrated ability to address AI privacy and security challenges, and compliance with data protection regulations.
- Familiarity with RAG and its integration into AI-driven applications.
- Excellent collaboration and communication skills, with cross-functional teamwork experience.
- Passion for staying at the forefront of AI infrastructure and applying new technologies to solve real-world problems at scale.