About
MedAgentsBench is a comprehensive benchmarking suite for evaluating thinking models and agent frameworks on complex medical reasoning tasks. It focuses on challenging medical questions where models achieve less than 50% accuracy, featuring 894 hard questions across 10 medical datasets including MedQA, PubMedQA, MedMCQA, and specialized expert-level questions.
Tech Stack
PythonHugging Face DatasetsOpenAI API
Research Paper
View PaperQuick Start
pip install -r requirements.txt