Sully.ai

Applied Research Scientist

Sully.ai

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Build and scale automated evaluation pipelines (LLM-as-judge + human review) with clinical-grade benchmarks.
  • Audit existing evaluation approaches for clinical and agentic tasks.
  • Define initial benchmarks and build early automated pipelines.
  • Partner with engineering to land first set of CI gates for accuracy, factuality, and safety.
  • Deliver a repeatable evaluation framework with automated pipelines in production.
  • Demonstrate measurable improvements in robustness, hallucination reduction, or safety.
  • Publish or present internal research findings that directly shape product reliability.

Requirements

  • Proven experience designing agentic processes and LLM evaluation/benchmarking frameworks.
  • Strong Python and ML background (PyTorch/TensorFlow, Hugging Face, LangChain/LlamaIndex).
  • Demonstrated ability to design rigorous experiments and translate findings into production.
  • Track record of published research or deep applied work in LLMs and agent evaluation.
  • Strong communication and technical writing skills to articulate complex findings clearly.
Benefits
  • Speed matters - we operate with urgency, autonomy, and ownership
  • You’ll work on real, first-of-their-kind problems at the edge of AI and medicine
  • Your work helps doctors reclaim their time - and patients get better, faster care
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonMLPyTorchTensorFlowHugging FaceLangChainLlamaIndexautomated evaluation pipelinesclinical-grade benchmarksrigorous experiments
Soft Skills
strong communicationtechnical writingarticulate complex findings