Sully.ai

Applied Research Scientist

Sully.ai

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

PythonPyTorchTensorflow

About the role

  • Design and run experiments to measure accuracy, robustness, and hallucination rates in LLM agents
  • Build automated evaluation pipelines (LLM-as-judge + human review) with clinical-grade benchmarks
  • Partner with Research Ops/IRB to design efficacy studies and align with regulatory requirements
  • Translate research into production-ready evaluation systems, collaborating with engineering to land features 0→1
  • Develop error taxonomies, ablations, and guardrails to ensure safe and reliable agent behaviors
  • Audit existing evaluation approaches for clinical and agentic tasks (first-month focus)
  • Define initial benchmarks and build early automated pipelines (first-month focus)
  • Partner with engineering to land CI gates for accuracy, factuality, and safety (first-month focus)
  • Deliver repeatable evaluation framework with automated pipelines in production (90-day OKR)
  • Demonstrate measurable improvements in robustness, hallucination reduction, or safety (90-day OKR)
  • Publish or present internal research findings that directly shape product reliability (90-day OKR)

Requirements

  • Proven experience designing agentic processes and LLM evaluation/benchmarking frameworks
  • Strong Python and ML background (PyTorch/TensorFlow, Hugging Face, LangChain/LlamaIndex)
  • Demonstrated ability to design rigorous experiments and translate findings into production
  • Track record of published research or deep applied work in LLMs and agent evaluation
  • Strong communication and technical writing skills
  • Prior work in healthcare/clinical NLP with awareness of medical data standards (nice-to-have)
  • Experience running IRB-aligned or clinical-grade studies (nice-to-have)
  • Exposure to noisy/limited medical data and designing strategies to overcome constraints (nice-to-have)
  • Please be aware that, unfortunately, at this time, we are unable to sponsor new visas.
Proximity Works

Senior Data Scientist – LLMs, RAG, Multimodal AI

Proximity Works
Seniorfull-time🇮🇳 India
Posted: 10 days agoSource: apply.workable.com
PythonPyTorchTensorflow
LatentView Analytics

AI/ML Engineer

LatentView Analytics
Junior · Midfull-timeDistrict of Columbia · 🇺🇸 United States
Posted: 26 days agoSource: ats.rippling.com
PythonPyTorch
Autodesk

AI Research Scientist, Multimodal

Autodesk
Mid · Seniorfull-time🇬🇧 United Kingdom
Posted: 1 day agoSource: autodesk.wd1.myworkdayjobs.com
PythonPyTorch
hims & hers

Sr. Staff Machine Learning Engineer

hims & hers
Seniorfull-time$240k–$260k / year🇺🇸 United States
Posted: 37 days agoSource: jobs.ashbyhq.com
AWSPythonPyTorchTensorflow
hims & hers

Staff Machine Learning Engineer

hims & hers
Leadfull-time$210k–$230k / year🇺🇸 United States
Posted: 37 days agoSource: jobs.ashbyhq.com
AWSPythonPyTorch