MedAgents Benchmark

by Gerstein Lab

View on GitHub

72stars

8forks

Python

MIT

About

MedAgentsBench is a comprehensive benchmarking suite for evaluating thinking models and agent frameworks on complex medical reasoning tasks. It focuses on challenging medical questions where models achieve less than 50% accuracy, featuring 894 hard questions across 10 medical datasets including MedQA, PubMedQA, MedMCQA, and specialized expert-level questions.

Tech Stack

PythonHugging Face DatasetsOpenAI API

Research Paper

View Paper

Quick Start

pip install -r requirements.txt

Back to all repositories