
Applied Research Scientist
Sully.ai
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Tech Stack
About the role
- Build and scale automated evaluation pipelines (LLM-as-judge + human review) with clinical-grade benchmarks.
- Audit existing evaluation approaches for clinical and agentic tasks.
- Define initial benchmarks and build early automated pipelines.
- Partner with engineering to land first set of CI gates for accuracy, factuality, and safety.
- Deliver a repeatable evaluation framework with automated pipelines in production.
- Demonstrate measurable improvements in robustness, hallucination reduction, or safety.
- Publish or present internal research findings that directly shape product reliability.
Requirements
- Proven experience designing agentic processes and LLM evaluation/benchmarking frameworks.
- Strong Python and ML background (PyTorch/TensorFlow, Hugging Face, LangChain/LlamaIndex).
- Demonstrated ability to design rigorous experiments and translate findings into production.
- Track record of published research or deep applied work in LLMs and agent evaluation.
- Strong communication and technical writing skills to articulate complex findings clearly.
Benefits
- Speed matters - we operate with urgency, autonomy, and ownership
- You’ll work on real, first-of-their-kind problems at the edge of AI and medicine
- Your work helps doctors reclaim their time - and patients get better, faster care
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonMLPyTorchTensorFlowHugging FaceLangChainLlamaIndexautomated evaluation pipelinesclinical-grade benchmarksrigorous experiments
Soft Skills
strong communicationtechnical writingarticulate complex findings