About
MedAgentBench is a realistic virtual EHR environment designed to benchmark medical LLM agents on clinical tasks. Built on top of AgentBench, it provides a Docker-based FHIR server environment that simulates realistic electronic health record interactions. The benchmark evaluates how well LLM agents can navigate clinical workflows, make decisions based on patient data, and execute tasks in a standardized healthcare information system, supporting models like GPT-4o, Gemini, and Claude through configurable agent configurations.
Tech Stack
PythonDockerFHIROpenAI APIVertex AIAgentBench
Research Paper
View PaperQuick Start
conda create -n medagentbench python=3.9 && pip install -r requirements.txt && docker pull jyxsu6/medagentbench:latest && docker run -p 8080:8080 medagentbench