DocHop: Benchmarking Out-of-domain Multi-hop Reasoning in Information-Dense Documents

Overview

DocHop is designed to evaluate whether multimodal models can perform integrated chart–document reasoning rather than treating charts and text as separate sources of information. Each instance is built around a symbolic reasoning trace that specifies how candidate entities should be filtered and combined through multi-step constraints. The document narrative verbalizes this reasoning specification, while the charts provide the corresponding numerical evidence. Questions refer to a semantic reference label introduced in the narrative instead of directly naming the target entities, forcing models to first resolve the relevant entities from context and then retrieve or aggregate evidence from the charts. In this way, DocHop isolates a controlled out-of-domain reasoning challenge: using document context to determine which chart evidence is relevant and how it should be reasoned over.

Leaderboard

Model	Value Retrieval	Counting	Numeric Reasoning	Ranking	Hypothetical	Fact Checking	Overall
Human Eval	92.50	93.06	95.24	94.61	87.38	96.43	92.60
GPT-5.2-ReasoningReasoningProprietary	62.94	65.23	47.50	54.92	59.87	70.98	60.18
Gemini-2.5-Pro-ReasoningReasoningProprietary	50.16	53.31	32.50	46.67	41.42	41.96	44.24
GPT-5.2Proprietary	34.50	47.02	20.31	30.48	33.01	48.58	35.55
Gemini-2.5-Flash-ReasoningReasoningProprietary	40.58	48.26	18.75	37.14	30.74	48.26	36.67
GPT-5-mini-ReasoningReasoningProprietary	33.47	30.46	14.37	37.14	28.16	58.99	33.48
Gemini-2.5-FlashProprietary	32.91	31.46	14.69	32.70	29.45	40.06	30.17
GPT-5-miniProprietary	25.24	20.53	10.62	32.06	21.36	51.10	26.87
Qwen-2.5-VL-7BOpen-Source	18.53	21.85	6.25	25.08	17.15	41.64	21.75
Claude-4.5-SonnetProprietary	18.53	29.14	4.37	13.65	15.53	38.17	19.83
Qwen3-VL-8BOpen-Source	15.65	16.89	7.81	15.24	18.77	47.95	20.42
Molmo-7B-D-0934Open-Source	15.65	22.52	5.63	12.70	12.94	44.16	18.92
Molmo-7B-O-0934Open-Source	9.90	24.50	2.50	14.29	15.53	46.06	18.76
InternVL-3.5-8BOpen-Source	7.03	11.92	3.44	11.11	13.59	49.21	16.10
Ovis1.6-Gemma2-9BOpen-Source	4.47	14.24	1.56	6.35	11.00	49.84	14.61
LLaVA-Next-LLaMA3-8BOpen-Source	0.00	8.94	0.31	0.00	11.00	46.37	11.14
Claude-3.7-SonnetProprietary	3.83	3.64	3.12	12.06	10.36	28.39	10.29
IDEFICS3-LLaMA3-8BOpen-Source	4.79	15.89	1.25	0.32	7.44	21.14	8.42
IDEFICS2-8BOpen-Source	1.28	2.65	0.94	0.32	8.74	31.23	7.57

DocHop: Benchmarking Out-of-domain Multi-hop Reasoning in Information-Dense Documents

Overview

Dataset Preview

Leaderboard

Contact