DocHop is designed to evaluate whether multimodal models can perform integrated chart–document reasoning rather than treating charts and text as separate sources of information. Each instance is built around a symbolic reasoning trace that specifies how candidate entities should be filtered and combined through multi-step constraints. The document narrative verbalizes this reasoning specification, while the charts provide the corresponding numerical evidence. Questions refer to a semantic reference label introduced in the narrative instead of directly naming the target entities, forcing models to first resolve the relevant entities from context and then retrieve or aggregate evidence from the charts. In this way, DocHop isolates a controlled out-of-domain reasoning challenge: using document context to determine which chart evidence is relevant and how it should be reasoned over.
| Model | Value Retrieval | Counting | Numeric Reasoning | Ranking | Hypothetical | Fact Checking | Overall |
|---|---|---|---|---|---|---|---|
Human Eval |
92.50 | 93.06 | 95.24 | 94.61 | 87.38 | 96.43 | 92.60 |
GPT-5.2-ReasoningReasoningProprietary |
62.94 | 65.23 | 47.50 | 54.92 | 59.87 | 70.98 | 60.18 |
Gemini-2.5-Pro-ReasoningReasoningProprietary |
50.16 | 53.31 | 32.50 | 46.67 | 41.42 | 41.96 | 44.24 |
GPT-5.2Proprietary |
34.50 | 47.02 | 20.31 | 30.48 | 33.01 | 48.58 | 35.55 |
Gemini-2.5-Flash-ReasoningReasoningProprietary |
40.58 | 48.26 | 18.75 | 37.14 | 30.74 | 48.26 | 36.67 |
GPT-5-mini-ReasoningReasoningProprietary |
33.47 | 30.46 | 14.37 | 37.14 | 28.16 | 58.99 | 33.48 |
Gemini-2.5-FlashProprietary |
32.91 | 31.46 | 14.69 | 32.70 | 29.45 | 40.06 | 30.17 |
GPT-5-miniProprietary |
25.24 | 20.53 | 10.62 | 32.06 | 21.36 | 51.10 | 26.87 |
Qwen-2.5-VL-7BOpen-Source |
18.53 | 21.85 | 6.25 | 25.08 | 17.15 | 41.64 | 21.75 |
Claude-4.5-SonnetProprietary |
18.53 | 29.14 | 4.37 | 13.65 | 15.53 | 38.17 | 19.83 |
Qwen3-VL-8BOpen-Source |
15.65 | 16.89 | 7.81 | 15.24 | 18.77 | 47.95 | 20.42 |
Molmo-7B-D-0934Open-Source |
15.65 | 22.52 | 5.63 | 12.70 | 12.94 | 44.16 | 18.92 |
Molmo-7B-O-0934Open-Source |
9.90 | 24.50 | 2.50 | 14.29 | 15.53 | 46.06 | 18.76 |
InternVL-3.5-8BOpen-Source |
7.03 | 11.92 | 3.44 | 11.11 | 13.59 | 49.21 | 16.10 |
Ovis1.6-Gemma2-9BOpen-Source |
4.47 | 14.24 | 1.56 | 6.35 | 11.00 | 49.84 | 14.61 |
LLaVA-Next-LLaMA3-8BOpen-Source |
0.00 | 8.94 | 0.31 | 0.00 | 11.00 | 46.37 | 11.14 |
Claude-3.7-SonnetProprietary |
3.83 | 3.64 | 3.12 | 12.06 | 10.36 | 28.39 | 10.29 |
IDEFICS3-LLaMA3-8BOpen-Source |
4.79 | 15.89 | 1.25 | 0.32 | 7.44 | 21.14 | 8.42 |
IDEFICS2-8BOpen-Source |
1.28 | 2.65 | 0.94 | 0.32 | 8.74 | 31.23 | 7.57 |
We welcome feedback, suggestions, and questions about DocHop. If you encounter any issues with the benchmark or are interested in collaboration, feel free to reach out at zhuoran.yu@wisc.edu.