Applied Research Scientist

Sully.ai

full-time

Posted on: 2/26/2026

Location Type: Remote

✨ AI Apply

About the role

Build and scale automated evaluation pipelines (LLM-as-judge + human review) with clinical-grade benchmarks.
Audit existing evaluation approaches for clinical and agentic tasks.
Define initial benchmarks and build early automated pipelines.
Partner with engineering to land first set of CI gates for accuracy, factuality, and safety.
Deliver a repeatable evaluation framework with automated pipelines in production.
Demonstrate measurable improvements in robustness, hallucination reduction, or safety.
Publish or present internal research findings that directly shape product reliability.

Proven experience designing agentic processes and LLM evaluation/benchmarking frameworks.
Strong Python and ML background (PyTorch/TensorFlow, Hugging Face, LangChain/LlamaIndex).
Demonstrated ability to design rigorous experiments and translate findings into production.
Track record of published research or deep applied work in LLMs and agent evaluation.
Strong communication and technical writing skills to articulate complex findings clearly.

Benefits

Speed matters - we operate with urgency, autonomy, and ownership
You’ll work on real, first-of-their-kind problems at the edge of AI and medicine
Your work helps doctors reclaim their time - and patients get better, faster care

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PythonMLPyTorchTensorFlowHugging FaceLangChainLlamaIndexautomated evaluation pipelinesclinical-grade benchmarksrigorous experiments

Soft Skills

strong communicationtechnical writingarticulate complex findings