Tech Stack
CloudKubernetesNumpyPandasPythonPyTorchRustScikit-LearnSQLTensorflow
About the role
- Translate product problems into ML approaches, baseline against simple methods, and define success metrics and evaluation protocols.
- Build and maintain training datasets.
- Iterate on models, then graduate them from notebooks to production-quality services with proper testing, logging, and observability.
- Design, deploy, and maintain inference services and batch jobs. Handle scale, latency, and cost with appropriate architectures, queues, and caching.
- Monitor for data drift, model degradation, and operational issues. Set up dashboards and alerts, run A/B or offline evaluations.
- Work across the stack as needed: schema design, APIs, CI/CD, containerization, and infrastructure-as-code in cloud environments.
- Contribute to agentic AI development: implement orchestration layers where AI agents can plan, act, and use tools, and integrate these agents with Case Status’ platform using MCP.
- Document decisions and results, and communicate trade-offs to technical and non-technical stakeholders.
Requirements
- Strong software engineering fundamentals and production experience in Python.
- Hands-on experience training or fine-tuning models and shipping them to production.
- Practical knowledge of model evaluation and experiment design, including offline metrics and online testing.
- Experience building data pipelines for ML (feature extraction, labeling workflows, data versioning).
- Deep learning frameworks: PyTorch, TensorFlow, or comparable.
- Classical ML: scikit-learn or comparable.
- Serving and optimization: ONNX, TensorRT, or similar.
- Data & tooling: SQL, pandas, NumPy, experience with scalable storage and data pipelines in cloud environments.
- MLOps and platform: containers, Kubernetes, observability, CI/CD, and infrastructure-as-code to support training and serving.
- Experience in at least one of NLP, recommendation, time series, tabular modeling, computer vision, or LLM application patterns.
- Familiarity with MCP, agent frameworks, and an interest in building production systems where AI models act as semi-autonomous agents.
- Understanding of safe agent design: grounding, guardrails, and human-in-the-loop systems.
- Familiarity with Databricks or similar large-scale data platforms is a plus.
- Experience with high-performance inference (C++/Rust or GPU/TPU optimization) is a plus.
- Background in building ML platform tools adopted by other teams (feature stores, experiment tracking, model registries) is a plus.
- Exposure to agent frameworks or orchestration layers for AI systems is a plus.