DocHop: Benchmarking Out-of-domain Multi-hop Reasoning in Information-Dense Documents

Zhuoran Yu1, Le Thien Phuc Nguyen1, Jaden Park1, Xinyi Gu2, Zexue He3, Soochahn Lee4, Rogerio Feris5, Yong Jae Lee1

1University of Wisconsin-Madison   2Massachusetts Institute of Technology   3Stanford University   4Kookmin University   5MIT-IBM Watson AI Lab

Correspondence: zhuoran.yu@wisc.edu

Overview

DocHop is designed to evaluate whether multimodal models can perform integrated chart–document reasoning rather than treating charts and text as separate sources of information. Each instance is built around a symbolic reasoning trace that specifies how candidate entities should be filtered and combined through multi-step constraints. The document narrative verbalizes this reasoning specification, while the charts provide the corresponding numerical evidence. Questions refer to a semantic reference label introduced in the narrative instead of directly naming the target entities, forcing models to first resolve the relevant entities from context and then retrieve or aggregate evidence from the charts. In this way, DocHop isolates a controlled out-of-domain reasoning challenge: using document context to determine which chart evidence is relevant and how it should be reasoned over.

Dataset Preview

Leaderboard

Model Value Retrieval Counting Numeric Reasoning Ranking Hypothetical Fact Checking Overall
Human Eval
92.5093.0695.2494.6187.3896.4392.60
GPT-5.2-ReasoningReasoningProprietary
62.9465.2347.5054.9259.8770.9860.18
Gemini-2.5-Pro-ReasoningReasoningProprietary
50.1653.3132.5046.6741.4241.9644.24
GPT-5.2Proprietary
34.5047.0220.3130.4833.0148.5835.55
Gemini-2.5-Flash-ReasoningReasoningProprietary
40.5848.2618.7537.1430.7448.2636.67
GPT-5-mini-ReasoningReasoningProprietary
33.4730.4614.3737.1428.1658.9933.48
Gemini-2.5-FlashProprietary
32.9131.4614.6932.7029.4540.0630.17
GPT-5-miniProprietary
25.2420.5310.6232.0621.3651.1026.87
Qwen-2.5-VL-7BOpen-Source
18.5321.856.2525.0817.1541.6421.75
Claude-4.5-SonnetProprietary
18.5329.144.3713.6515.5338.1719.83
Qwen3-VL-8BOpen-Source
15.6516.897.8115.2418.7747.9520.42
Molmo-7B-D-0934Open-Source
15.6522.525.6312.7012.9444.1618.92
Molmo-7B-O-0934Open-Source
9.9024.502.5014.2915.5346.0618.76
InternVL-3.5-8BOpen-Source
7.0311.923.4411.1113.5949.2116.10
Ovis1.6-Gemma2-9BOpen-Source
4.4714.241.566.3511.0049.8414.61
LLaVA-Next-LLaMA3-8BOpen-Source
0.008.940.310.0011.0046.3711.14
Claude-3.7-SonnetProprietary
3.833.643.1212.0610.3628.3910.29
IDEFICS3-LLaMA3-8BOpen-Source
4.7915.891.250.327.4421.148.42
IDEFICS2-8BOpen-Source
1.282.650.940.328.7431.237.57

Contact

We welcome feedback, suggestions, and questions about DocHop. If you encounter any issues with the benchmark or are interested in collaboration, feel free to reach out at zhuoran.yu@wisc.edu.