GLaMM_face

VideoMathQA
Benchmarking Mathematical Reasoning
via Multimodal Understanding in Videos

MBZUAI, University of California Merced, Google Research,
Australian National University, Linköping University

VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos. It requires models to interpret and integrate information from three modalities, visuals, audio, and text, across time. The benchmark tackles the needle-in-a-multimodal-haystack problem, where key information is sparse and spread across different modelities and moments in the video.

🔥 Highlights

  1. Multimodal Reasoning Benchmark: VideoMathQA introduces a challenging “needle-in-a-multimodal-haystack” setup where models must reason across visuals, text and audio. Key information is sparsely distributed across modalities and time, requiring strong performance in fine-grained visual understanding, multimodal integration, and reasoning.

  2. Three Types of Reasoning: Questions are categorized into: Problem Focused, where the question is explicitly stated and solvable via direct observation and reasoning from the video; Concept Transfer, where a demonstrated method or principle is adapted to a newly posed problem; and Deep Instructional Comprehension, which requires understanding long-form instructional content, interpreting partially worked-out steps, and completing the solution.

  3. Diverse Evaluation Dimensions: Each question is evaluated across four axes: (i) mathematic concepts, 10 domains such as geometry, statistics, arithmetics and charts; (ii) video duration ranging from 10s to 1hour long categorized as short, medium, long; (iii) difficulty level; and (iv) reasoning type. This structure captures diversity in content, length, complexity, and reasoning depth.

  4. High-Quality Human Annotations: The benchmark includes 420 expert-curated questions, each with five answer choices, a correct answer, and detailed chain-of-thought (CoT) steps. Over 2,945 reasoning steps have been manually written, reflecting 920+ hours of expert annotation effort with rigorous quality control.
Highlight Figure

The foundation of our benchmark is the “needle-in-a-multimodal-haystack” challenge, capturing the core difficulty of cross-modal reasoning across time from visual, textual, and audio streams. Built on this, VideoMathQA categorizes each question along four key dimensions: reasoning type, mathematical concept, video duration, and difficulty.

Benchmark Examples

Problem Focused

Concept Transfer

Deep Instructional Comprehension

🏅 VideoMathQA Leaderboard - Chain-of-Thought Reasoning 🧠

Accuracy on the VideoMathQA benchmark using using Chain-of-Thoughts (CoT) reasoning for MCQ and Multi-Binary tasks, with and without subtitles.
Shows model performance across mathematical concepts and video lengths. This leaderboard is sorted by results on Multi-Binary with subtitles.

# Models Size MCQ MBin Mathematic Concepts Duration CoT
Step
Score
V +Sub V +Sub GAng GAre GLen Chart Stat Arth Topo Grph Cntg Pzle Short Med Long
Video-R1 7B 23.827.618.120.0 13.026.823.59.313.034.620.016.718.416.7 21.626.011.43.9
LLaVA-Video 7B 26.423.620.016.0 4.415.523.516.021.77.726.70.021.118.5 16.416.914.42.7
Qwen2.5-VL 7B 25.229.517.618.3 13.015.511.820.021.736.513.316.710.516.7 16.420.118.23.7
InternVL3 8B 28.826.917.920.0 17.422.527.513.34.417.313.316.77.924.1 19.423.49.93.4
LLaVA-Video 72B 23.629.314.818.6 8.722.517.714.78.721.226.711.126.320.4 17.221.416.73.1
LLaVA-OV 72B 23.326.914.318.1 8.714.119.613.321.726.920.022.210.525.9 15.723.414.43.2
Qwen2.5-VL 72B 37.436.924.528.6 30.431.031.424.021.750.013.322.215.825.9 27.634.422.75.0
InternVL3 78B 34.137.125.227.9 39.139.433.313.326.123.133.322.210.540.7 28.436.417.44.9
Claude-3.7-sonnet - 24.829.512.119.3 34.829.619.64.026.113.520.016.721.122.2 23.126.07.64.2
GPT-4o - 27.134.318.622.9 26.122.517.717.330.432.720.033.313.225.9 19.429.918.24.9
Gemini-2.0-Flash - 35.238.819.524.8 34.821.127.518.721.728.913.333.318.433.3 27.627.918.24.7
GPT-o4-mini - 49.861.442.144.8 43.549.345.140.065.263.520.072.223.731.5 45.544.842.46.9

🏅 VideoMathQA Leaderboard - Direct Answering 🎯

Accuracy on the VideoMathQA benchmark using direct answer format (no CoT reasoning) for MCQ and Multi-Binary tasks, with and without subtitles.
Shows model performance across mathematical concepts and video lengths. This leaderboard is sorted by results on Multi-Binary with subtitles.

# Models Size MCQ MBin Mathematic Concepts Duration
V +Sub V +Sub GAng GAre GLen Chart Stat Arth Topo Grph Cntg Pzle Short Med Long
Claude-3.7-sonnet - 26.227.18.69.517.49.95.98.017.411.513.35.65.39.38.211.09.1
GPT-4o - 20.224.512.613.613.012.715.712.04.417.320.05.67.920.414.215.610.6
Gemini-2.0-Flash - 28.631.714.120.530.423.927.513.38.719.213.316.77.933.325.424.011.4
Gemini-1.5-Flash - 20.523.112.617.626.115.519.69.317.423.16.722.215.824.117.922.112.1
Qwen2.5-VL 3B 26.927.619.319.626.123.923.521.334.817.326.711.115.820.425.423.415.9
InternVL2.5 2B 24.320.714.314.521.79.927.510.74.415.420.00.015.816.717.916.98.3
PLM-LLaMA 3B 22.922.113.615.017.416.925.58.026.19.620.011.113.213.016.418.89.1
InternVL3 2B 22.423.318.816.421.716.917.717.330.415.420.022.213.25.618.714.915.9
PLM-LLaMA 8B 22.123.116.714.513.011.317.713.317.417.320.011.110.516.716.414.912.1
Oryx-1.5 7B 22.622.616.917.413.023.923.59.321.723.120.05.618.411.120.220.810.6
LLaVA-OV 7B 20.721.214.815.58.715.517.716.030.417.313.35.615.811.116.418.810.6
LongVA-DPO 7B 21.421.716.214.18.715.517.712.030.49.66.75.610.518.514.911.715.9
Video-R1 7B 21.417.416.016.28.722.525.516.026.113.56.75.613.29.316.416.915.2
InternVL2.5 8B 24.324.818.618.626.119.717.717.321.719.226.711.110.520.417.922.714.4
LLaVA-Video 7B 26.926.420.019.313.021.131.417.317.415.426.75.618.418.523.920.812.9
InternVideo2.5 8B 25.228.619.119.134.822.515.714.721.719.220.027.810.518.518.722.115.9
Qwen2.5-VL 7B 26.727.919.819.18.725.425.518.713.023.113.35.615.816.722.419.515.2
InternVL3 8B 29.127.920.020.713.029.627.513.313.028.920.022.215.814.825.424.012.1
VideoChat-R1 7B 27.629.121.221.28.722.531.421.317.430.86.711.115.818.526.920.116.7
Aria 34B 23.826.417.419.18.725.419.622.717.419.220.011.121.111.121.616.918.9
Oryx-1.5 32B 30.533.122.924.130.439.431.410.717.421.26.711.115.833.327.629.913.6
Qwen2.5-VL 32B 32.432.625.724.843.531.025.514.726.126.96.727.810.533.328.430.514.4
InternVL2.5 38B 31.033.624.126.043.538.039.28.013.032.76.711.118.429.634.331.810.6
InternVL3 38B 31.735.725.229.534.842.337.313.317.425.013.333.326.340.735.838.312.9
LLaVA-Video 72B 28.330.020.224.38.732.425.520.013.036.513.322.221.124.127.627.317.4
LLaVA-OV 72B 25.528.321.024.817.431.023.512.021.738.520.027.818.431.530.628.614.4
InternVL2.5 78B 33.331.728.327.939.136.631.418.726.132.726.727.813.227.833.635.113.6
Qwen2.5-VL 72B 36.937.626.027.926.136.631.417.330.438.520.016.718.429.634.329.219.7
InternVL3 78B 33.331.728.327.939.136.631.418.726.132.726.727.813.227.833.635.113.6

Overview and Analysis of VideoMathQA

Examples from the Benchmark

Figure 1

Example questions from the VideoMathQA illustrating the three reasoning types: Problem Focused, Concept Transfer, and Deep Comprehension. The benchmark includes evolving dynamics in a video, complex text prompts, five multiple-choice options, the expert-annotated step-by-step reasoning to solve the given problem, and the final correct answer as shown above.

Overview of VideoMathQA

Figure 1

The figure illustrates a) Distribution of questions and model performance across ten mathematical concepts in the VideoMathQA. The consistently low performance across all concepts reveals a significant gap in the ability of the current multimodal models to perform mathematical reasoning over videos. b) Distribution of video durations in VideoMathQA, highlighting a diverse range from short clips of 10s to long-videos up to 1hr. c) The three-stage annotation pipeline for VideoMathQA was performed by expert science graduates, who annotated detailed step-by-step reasoning trails, with each stage governed by strict quality assessment.

Effect of Video Length, Subtitles, and Frame Count on Multimodal Reasoning

Figure 2

The figure illustrates VideoMathQA performance (a) Across video duration categories, (b) The impact of subtitles, and (c) Effect of varying the number of input frames. Overall, models perform best on medium-length videos, and overall accuracy improves with the inclusion of subtitles and more frames during evaluation.

Understanding Model Limitations in VideoMathQA Reasoning

Figure 3

The figure shows a) Comparison among vision-blind, image-only, and video models, highlighting the need for video-level understanding to perform well in VideoMathQA. b) Distribution of questions in VideoMathQA across three difficulty levels for varying reasoning depths, and the relationship between performance and question difficulty across top-performing models. c) Error analysis based on CoT step evaluation. Most model errors stem from misunderstanding the question, where models misinterpret what the question asks or overlook critical multimodal cues.

IVAL Logo Oryx Logo MBZUAI Logo