VideoMathQA

VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos. It requires models to interpret and integrate information from three modalities, visuals, audio, and text, across time. The benchmark tackles the needle-in-a-multimodal-haystack problem, where key information is sparse and spread across different modelities and moments in the video.

🔥 Highlights

Multimodal Reasoning Benchmark: VideoMathQA introduces a challenging “needle-in-a-multimodal-haystack” setup where models must reason across visuals, text and audio. Key information is sparsely distributed across modalities and time, requiring strong performance in fine-grained visual understanding, multimodal integration, and reasoning.

Three Types of Reasoning: Questions are categorized into: Problem Focused, where the question is explicitly stated and solvable via direct observation and reasoning from the video; Concept Transfer, where a demonstrated method or principle is adapted to a newly posed problem; and Deep Instructional Comprehension, which requires understanding long-form instructional content, interpreting partially worked-out steps, and completing the solution.

Diverse Evaluation Dimensions: Each question is evaluated across four axes: (i) mathematic concepts, 10 domains such as geometry, statistics, arithmetics and charts; (ii) video duration ranging from 10s to 1hour long categorized as short, medium, long; (iii) difficulty level; and (iv) reasoning type. This structure captures diversity in content, length, complexity, and reasoning depth.

High-Quality Human Annotations: The benchmark includes 420 expert-curated questions, each with five answer choices, a correct answer, and detailed chain-of-thought (CoT) steps. Over 2,945 reasoning steps have been manually written, reflecting 920+ hours of expert annotation effort with rigorous quality control.

The foundation of our benchmark is the “needle-in-a-multimodal-haystack” challenge, capturing the core difficulty of cross-modal reasoning across time from visual, textual, and audio streams. Built on this, VideoMathQA categorizes each question along four key dimensions: reasoning type, mathematical concept, video duration, and difficulty.

Benchmark Examples

Problem Focused

Question:
What is the minimum angle made by AE with the flat surface?

A. 22 degree
B. 30 degree
C. 37 degree
D. 45 degree
E. 15 degree

Answer: D. 45 degree

Show Reasoning Steps

The problem involves a square ABCD lying flat on a horizontal surface, with two flaps AE and AF attached along sides AB and AD, respectively. Each flap is part of a regular hexagon.
The two hexagonal flaps are rotated upward about sides AB and AD by an angle T, until points E and F coincide, meaning that AE = AF.
The task is to determine the angle between flap AE and the flat surface. Since a regular hexagon has an external angle of 60°, which remains unchanged during the rotation, we can imagine that a 30-60-90 triangle is formed, with AE as the hypotenuse.
In a 30-60-90 triangle, the hypotenuse is twice the shorter leg, and the longer leg is √3 times the shorter leg. Here, AE = 2, so the short leg is 1, and the long leg is √3.
Set up a coordinate system with point A as the origin and the y-axis perpendicular to the square. After rotation, the coordinates of point E become (1, √3 cos T, √3 sin T), and the coordinates of point F become (√3 cos T, 1, √3 sin T).
For points E and F to coincide, their coordinates must be identical. This leads to the equation 1 = √3 cos T, which yields E = (1, 1, √2) after substitution.
Based on the coordinates of E, we can form a spatial right triangle with AE as the hypotenuse. Its vertical height is √2, and the projection of its base also has length √2, forming an isosceles right triangle.\", \"8\": \"In an isosceles right triangle, both acute angles are 45 degrees.
Therefore, the minimum angle that flap AE makes with the flat surface is 45 degrees.

Question:
Determine the shortest path distance from the lowest node to the topmost node in the given weighted graph.

A. 12
B. 7
C. 5
D. 13
E. 9

Answer: E. 9

Show Reasoning Steps

The goal is to find the shortest path from the bottom node (source) to the topmost node in the weighted directed graph.
Initialize: Set the distance to the source node to 0, and all other nodes to infinity. Mark all nodes as unvisited.
From the source (bottom) node, update the tentative distances of its three neighbors: left (edge 8), middle (edge 1), right (edge 7). Their distances become 8, 1, and 7, respectively.
Pick the unvisited node with the smallest tentative distance, which is the middle node just above the source (distance 1).
From this node, update its neighbors: going left (edge 4, total 1+4=5), going middle (edge 5, total 1+5=6). These are the new shortest known distances to those positions.
Pick the next unvisited node with the smallest distance, which is the left node just above the previous step (distance 5). From here, move to the topmost node using the edge with weight 4. The total distance is 5+4=9.
Check other possible moves from each node to ensure no shorter path exists to the topmost node. All alternative paths have greater total weights.
Mark the topmost node as visited with the shortest path found so far.
The shortest path from the source (bottom node) to the topmost node has total weight 9. This shortest path uses these edges in order: bottom → middle (edge 1), middle → left (edge 4), left → topmost (edge 4), for a total of 1 + 4 + 4 = 9.

Question:
The video shows data on how prices of some essentials rose significantly from 2021 to 2022 due to global inflationary pressures. Later in December 2022, prices for some of those items began to decline, showing early signs of deflation or correction.
Using these data, compare how the prices of the following three categories change: Airline fares, Fuel oil, and Utility gas (in this order)?

A. +30%, -3%; +40%, -16%; +20%, -11%
B. +28%, -3%; +41%, -17%; +19%, -4%
C. +28%, -3%; +41%, -16%; +19%, +5%
D. +30%, -5%; +45%, -15%; +20%, -10%
E. +28%, -3%; +41%, -17%; +19%, +3%

Answer:B. +28%, -3%; +41%, -17%; +19%, -4%

Show Reasoning Steps

The video investigates the causes and dynamics of inflation over the past few years, focusing on how different products experienced price surges due to various global and domestic factors such as supply chain disruptions, labor costs, material costs, and corporate pricing strategies. It aims to break down inflation into understandable components by looking at specific categories in the Consumer Price Index (CPI).
The video initially examines broad inflation metrics, introducing CPI and how it's calculated. Towards the middle of the video begins analyzing which goods and services have seen major price increases from December 2021 to December 2022. Later towards the end of the video, the narrative shifts to a discussion about how some of those previously inflated categories started dropping in price, signaling early signs of correction.
To answer the question, we need two charts, one from each section. First one at 2:44 showing annual inflation and the other monthly changes in December 2022—need to be interpreted, second chart at around 7:59 showing monthly changes in December 2022 after correction.
The first bar chart shows airline fares increased by approximately +28% from December 2021 to December 2022.
The second chart shows airline fares decreased by approximately -3% in December 2022.
In the first chart, fuel oil shows the largest visible increase, reaching +41% over the year.
In the second chart, fuel oil shows a strong decline, approximately -17% in December 2022.
Utility (piped) gas in the first chart shows an increase of around +19% between December 2021 and December 2022.
In the second chart, piped gas actually increases again in December 2022 by about +3%, making it one of the few items that didnt show price correction that month.
Summarizing all three: Airline fares: +28%, -3%; Fuel oil: +41%, -17%; Utility gas: +19%, +3%. The best answer choice matching this data is: +28%, -3%; +41%, -17%; +19%, +3%."

Concept Transfer

Question:
Find the area of each small parallelogram when one side of the parallelogram is divided to three equal parts, and other side to 5 equal parts.

A. 1/12
B. 1/15
C. 1/14
D. 1/13
E. 1/16

Answer: E. 1/16

Show Reasoning Steps

The video begins with a large parallelogram that has a total area of 1 square unit and shows how its opposite sides are divided into 3 and 4 equal parts respectively.
Lines are drawn connecting the division points on opposite sides, resulting in a pattern of small parallelograms inside the large one, and the video explores what the area of each of these small parallelograms is.
The solution proceeds visually by dissecting the large parallelogram into three pieces and rearranging them without changing the overall area, preserving the total area of 1 square unit after rearrangement.
After rearrangement, the new shape is made up of several small parallelograms that are all visually identical in size and shape, and by counting these, it is shown that there are exactly 13 small parallelograms.
Since these 13 parallelograms together still occupy an area of 1 square unit, the area of one such parallelogram is calculated as 1 divided by 13, giving the result 1/13 square units.
In the problem given to solve, the same approach is applied but with a different configuration where the sides of the large parallelogram are now divided into 3 and 5 equal parts respectively.
Using the pattern observed earlier, the number of small parallelograms formed by connecting division points is calculated using the formula N = p * q + 1, where p and q are the number of divisions on each pair of opposite sides.
Substituting the values p = 3 and q = 5 into the formula gives N = 3 * 5 + 1 = 15 + 1 = 16, so there will be 16 identical small parallelograms formed in this new arrangement.
Since the total area of the original parallelogram remains 1 square unit and is now equally divided among 16 small parallelograms, each one must have an area of 1/16 square units.
Therefore, by extending the visual reasoning and applying the general formula from the earlier case, the area of each small parallelogram in the 3-by-5 division setup is determined to be 1/16 square units.

Question:
How many triangles are there in the rightmost diamond-like figure from the final set of examples shown in the video?

A. 49
B. 48
C. 36
D. 60
E. 42

Answer: B. 48

Show Reasoning Steps

The video demonstrates a method for counting the total number of triangles in various complex geometric figures.
The core principle learned from these demonstarted examples is to systematically identify bases (or segments that act as bases) and the apexes associated with them. For any given base (or layer acting as a base) that is subdivided, the sum of the numbered subdivisions (1 + 2 + ... + n) gives the number of triangles sharing that apex and resting on that base/layer. When multiple such bases or layers exist, their contributions are typically summed up, or a multiplication factor is used if the structure is regular.
The question refers to the rightmost diamond-like figure presented in the final set of examples (shown around 03:19, next to a similar diamond figure that is explicitly discussed with 5 divisions).
First, identify the number of horizontal divisions in the diamond-like figure. Each horizontal division contributes to the formation of triangles.
Label the divisions from 1 to 6, as there are 6 divisions in the figure. This means you have numbers 1, 2, 3, 4, 5, and 6.
Calculate the sum of these numbers: 1 + 2 + 3 + 4 + 5 + 6 = 21. This sum represents the number of triangles formed by these divisions on one side of the base.
Since the figure has triangles on both the upper and lower sides of the base, multiply the sum by 2: 21 * 2 = 42. This accounts for triangles on both sides of the base.
Identify an additional line in the figure that converges triangles from both the upper and lower sides. This line allows for the formation of 6 more triangles.
Add these additional 6 triangles to the previous total: 42 + 6 = 48.
Thus, the total number of triangles in the rightmost diamond-like figure is 48.

Question:
In the final example shown in the video, the structure is made up of identical T-shaped blocks. Based on the examples shown earlier on how to solve cube counting, how many individual T-shaped blocks make up the entire structure?

A. 5
B. 6
C. 7
D. 8
E. 9

Answer: B. 6

Show Reasoning Steps

The video focuses on how to solve visual reasoning problems involving 3D block structures. The instructor demonstrates several examples that build the foundation for answering questions about how many identical blocks make up a complex structure.
From these examples, we learn five key principles: (1) use a systematic method to identify visible blocks (e.g., by layers or numbering), (2) apply the gravity rule to ensure unsupported blocks are not assumed, (3) only count hidden blocks if they are structurally necessary, (4) assume block uniformity within each structure, and (5) use logical deduction to infer the presence of essential but non-visible blocks.
The question, starting around the 7th minute involves a structure built entirely from identical T-shaped blocks, and asks how many such blocks are present, applying the same reasoning as in the earlier cube-counting examples.
We start by identifying the T-shaped blocks in the structure. The structure is composed entirely of these identical T-shaped blocks.
First, we count the obvious T-shaped blocks at the front of the structure. There are two blocks that stand alone, which we label as block 1 and block 2.
Next, we identify another T-shaped block on the base level, which we label as block 3.
We notice a T-shaped block that is one layer higher than block 3. We label this as block 4, and recognize that there must be another block underneath it to support it, which we label as block 5.
Finally, we identify one more T-shaped block positioned upside down at the back of the structure, which we label as block 6.
After accounting for all visible and necessary supporting blocks, we verify that there are no additional hidden blocks. The total count of T-shaped blocks in the structure is 6.

Deep Instructional Comprehension

Question:
Based on the lecture, complete the solution to find the volume of the solid.

A. π/5
B. 4π/3
C. 4π/5
D. 2π/3
E. 2π/5

Answer: A. π/5

Show Reasoning Steps

The Shell Method is a technique to find the volume of a solid of revolution by summing cylindrical shells. It uses representative rectangles drawn parallel to the axis of revolution: vertical rectangles (dx) for vertical axes like the y-axis, and horizontal rectangles (dy) for horizontal axes like the x-axis.
In the earlier example, the instructor revolves a region around the y-axis using vertical rectangles and integration with respect to x. In this problem, the region is revolved around the x-axis, so we use horizontal rectangles and integrate with respect to y using the formula: V = 2π ∫[c to d] r(y)h(y) dy. Our task is to identify the functions bounding the region in terms of y, determine the radius r(y), the height of the shells h(y), and the limits of integration c and d along the y-axis.
To find the volume of the solid formed by revolving the region enclosed by y = x^2, x = 1, and y = 0 around the y-axis using the shell method, we start by setting up the integral for the shell method. The formula for the shell method when revolving around the y-axis is V = 2π ∫[a to b] (radius)(height) dy.
Identify the bounds of integration. The region is bounded below by y=0. The upper bound is where x=√y intersects x=1. So, 1 = √y, which means y = 1^2 = 1. Thus, the limits of integration are from y=0 to y=1.
Determine the radius and height for the shell method. Radius r(y): The radius is the distance from the x-axis (axis of revolution) to a horizontal rectangle, which is simply y. So, r(y) = y. Now to Height h(y): The height of the shell (length of the horizontal rectangle) is the difference between the rightmost curve (x=1) and the leftmost curve (x=√y). So, h(y) = 1 - √y.
Substitute the radius and height into the integral:
V = 2π ∫[0 to 1] (y)(1-√y) dy
V = 2π ∫[0 to 1] (y)(1-y^0.5) dy
V = 2π ∫[0 to 1] (y-y^3/2) dy.
Calculate the Definite Integral:
V = 2π [ (y²)/2 - (y^(5/2))/(5/2) ] from 0 to 1
V = 2π [ (y²)/2 - (2/5)y^(5/2) ] from 0 to 1
V = 2π [(1²/2 - (2/5)(1)^(5/2)) - (0²/2 - (2/5)(0)^(5/2)) ]
V = 2π [ (1/2 - 2/5) - (0 - 0) ]
V = 2π [ 5/10 - 4/10 ]
V = 2π [ 1/10 ]
Thus, the volume of the solid of revolution is 2π/10, which simplifies to π/5.

Question:
Observe the strategy used in the video examples. Apply the same steps to complete the remaining part of the problem.

A. (3/2) arctan(x - 1) + (1/2) * sin(2 arctan(x - 1)) - (1/2) * cos(2 arctan(x - 1)) + C
B. arctan(x - 1) + x/(x² - 2x + 2) + C
C. (3/2)arctan(x - 1) + 1/2 * ((x - 1) / ((x-1)² + 1)) + C
D. (1/2) arctan(x - 1) + 1/2 * ((x - 1)/(x² - 2x + 2)) + C
E. (3/2) arctan(x - 1) + 1/2 * ((x² - x - 1)/(x² - 2x + 2)) + C

Answer: E. (3/2) arctan(x - 1) + 1/2 * ((x² - x - 1)/(x² - 2x + 2)) + C

Show Reasoning Steps

The video shows multiple examples of how to evaluate integrals using trigonometric substitution. To answer the question we need to first locate the problem that we need to complete. After solving four problems, at around the 34th minute, the presenter discusses a fifth problem. This is what we need to solve.
We are asked to evaluate the integral:
∫ (x² + 1) / (x² - 2x + 2)² dx
This is a rational function, and the denominator is a quadratic squared. The key insight is to complete the square in the denominator, which will help us apply a trigonometric substitution later on.
Now we need to understand, the part of the solution that is shown in the video. We begin by completing the square in the denominator of the integrand:
x² - 2x + 2 = (x - 1)² + 1
This reformulation allows us to use trigonometric substitution, which is ideal when the expression fits the identity 1 + tan²(θ) = sec²(θ). So, we let:
x - 1 = tan(θ)
Then:
x = tan(θ) + 1
dx = sec²(θ) dθ
We now substitute into the original integral:
∫ (x² + 1) / [(x - 1)² + 1]² dx
→ ∫ [(tan(θ) + 1)² + 1] / [tan²(θ) + 1]² * sec²(θ) dθ
So the full transformed integral becomes:
∫ [(tan(θ) + 1)² + 1] / [sec²(θ)]² * sec²(θ) dθ.
Expand and Simplify the Integrand
From the previous step, we now have the transformed integral:
∫ [(tan(θ) + 1)² + 1] / sec⁴(θ) * sec²(θ) dθ
Let us first expand the numerator:
(tan(θ) + 1)² + 1 = tan²(θ) + 2tan(θ) + 2
This gives us:
∫ [tan²(θ) + 2tan(θ) + 2] * sec²(θ) / sec⁴(θ) dθ
Now simplify the secants:
sec²(θ) / sec⁴(θ) = 1 / sec²(θ) = cos²(θ)
So the integral becomes:
∫ [tan²(θ) + 2tan(θ) + 2] * cos²(θ) dθ
This is where the algebra begins to simplify and the trigonometric identity work intensifies.
Now we can express everything in sine and cosine to make integration easier and to avoid the complexity of secants.
Using identities:
tan(θ) = sin(θ) / cos(θ)
tan²(θ) = sin²(θ) / cos²(θ)
We apply these to the expression:
tan²(θ) * cos²(θ) = sin²(θ)
2tan(θ) * cos²(θ) = 2sin(θ)cos(θ)
2 * cos²(θ) stays as-is
Now the integral becomes:
∫ [sin²(θ) + 2sin(θ)cos(θ) + 2cos²(θ)] dθ
This prepares us for identity-based grouping and antiderivatives in the next steps.
Group and Apply Trig Identities
Group the expressions and apply known identities:
sin²(θ) + cos²(θ) = 1
2sin(θ)cos(θ) = sin(2θ)
Remaining cos²(θ) is handled using the half-angle identity
So:
= ∫ [1 + sin(2θ) + cos²(θ)] dθ
Now use the identity:
cos²(θ) = (1 + cos(2θ)) / 2
This gives:
∫ [(3/2) + sin(2θ) + (1/2)cos(2θ)] dθ
Integrate Each Term
Now we perform term-by-term integration:
∫ (3/2) dθ = (3/2)θ
∫ sin(2θ) dθ = -1/2 cos(2θ)
∫ (1/2)cos(2θ) dθ = 1/4 sin(2θ)
Final result of the integral in terms of θ:
(3/2)θ - (1/2)cos(2θ) + (1/4)sin(2θ) + C
Use Triangle to Back Substitute
We now return to x by constructing a triangle.
Since x - 1 = tan(θ), we set up a right triangle where:
Opposite = x - 1
Adjacent = 1
Hypotenuse = √[(x - 1)² + 1] = √(x² - 2x + 2)
From the triangle:
sin(θ) = (x - 1) / √(x² - 2x + 2)
cos(θ) = 1 / √(x² - 2x + 2)
Use identities:
sin(2θ) = 2sin(θ)cos(θ)
cos(2θ) = cos²(θ) - sin²(θ)
Substitute all these into the expression.
Express in Terms of x
Substituting back:
θ = arctan(x - 1)
cos(2θ) = [1 - (x - 1)²] / (x² - 2x + 2)
sin(2θ) = 2(x - 1) / (x² - 2x + 2)
So the expression becomes:
(3/2) arctan(x - 1) - (1/2)[1 - (x - 1)² / (x² - 2x + 2)] + (1/2)[(x - 1) / (x² - 2x + 2)] + C
Simplify the middle term:
(1 - (x - 1)²) = (2x - x²)
Combine all:
= (3/2) arctan(x - 1) + (1/2)((x² - x - 1)/(x² - 2x + 2)) + C
Final Simplification Using Long Division
We simplify the rational term by dividing:
(x² - x - 1) / (x² - 2x + 2)
Performing long division:
The result is:
1 + (x - 3)/(x² - 2x + 2)
So we rewrite the integral as:
(3/2) arctan(x - 1) + (1/2) * [1 + (x - 3)/(x² - 2x + 2)] + C
Combine constants into a new constant D:
Final Answer:
(3/2) arctan(x - 1) + (1/2)(x - 3)/(x² - 2x + 2) + D,
where D = C + 1/2

Question:
Based on the lecture, complete the simplification of the next expression.

A. n³
B. √n (n/e)ⁿ
C. 2ⁿ / √n
D. n!
E. 2ⁿ

Answer: C. 2ⁿ / √n

Show Reasoning Steps

The video is from an MIT lecture where the instructor is solving a set of problems involving asymptotic analysis. Around the 14-minute mark, the instructor focuses on a specific function involving a binomial coefficient and begins the detailed process of estimating its growth rate using mathematical tools.
The blackboard lists five functions labeled f₁ through f₅: f₁ = 2ⁿ, f₂ = n³, f₃ = (n choose n/2), f₄ = n!, and f₅ = (n choose 3). The instructor chooses to analyze f₃, which is the central function of interest in this problem, and aims to determine its asymptotic growth using Stirling's approximation.
To begin, rewrite the binomial coefficient using its factorial definition: (n choose n/2) = n! / [(n/2)! · (n/2)!]. This step prepares the expression for the application of Stirling's approximation by expressing all parts in terms of factorials.
Now, recall Stirling's approximation: n! ≈ √(2πn) · (n/e)ⁿ. Using this, the numerator becomes √(2πn) · (n/e)ⁿ, and the denominator becomes [√(2π(n/2)) · (n/2e)^(n/2)]², since both (n/2)! terms are squared.
The squared denominator results in two parts: First, [√(2π(n/2))]² = πn, because the square of the square root eliminates the radical and doubles the argument. Second, [(n/2e)^(n/2)]² = (n/2e)ⁿ, because the exponents add when squaring. So the full denominator becomes πn · (n/2e)ⁿ.
Now, simplify the exponential expressions in the numerator and denominator. The term (n/e)ⁿ / (n/2e)ⁿ simplifies to (n / (n/2))ⁿ = 2ⁿ, since the eⁿ terms cancel and n / (n/2) = 2.
The square root terms simplify as well. √(2πn) / πn = √2 / √(πn), because √(2πn) stays in the numerator and πn remains in the denominator. This simplifies the entire expression to: (n choose n/2) ≈ (√2 / √(πn)) · 2ⁿ.
In asymptotic analysis, constant factors such as √2 and π are not significant. Therefore, the main contribution to the growth rate comes from 2ⁿ / √n, while constants are suppressed in Θ-notation.
Thus, the correct simplification is: Θ(2ⁿ / √n).
Final Answer: C. 2ⁿ / √n

🏅 VideoMathQA Leaderboard - Chain-of-Thought Reasoning 🧠

Accuracy on the VideoMathQA benchmark using using Chain-of-Thoughts (CoT) reasoning for MCQ and Multi-Binary tasks, with and without subtitles.
Shows model performance across mathematical concepts and video lengths. This leaderboard is sorted by results on Multi-Binary with subtitles.

#	Models	Size	MCQ		MBin		Mathematic Concepts										Duration			CoT Step Score
#	Models	Size	V	+Sub	V	+Sub	GAng	GAre	GLen	Chart	Stat	Arth	Topo	Grph	Cntg	Pzle	Short	Med	Long	CoT Step Score
	Video-R1	7B	23.8	27.6	18.1	20.0	13.0	26.8	23.5	9.3	13.0	34.6	20.0	16.7	18.4	16.7	21.6	26.0	11.4	3.9
	LLaVA-Video	7B	26.4	23.6	20.0	16.0	4.4	15.5	23.5	16.0	21.7	7.7	26.7	0.0	21.1	18.5	16.4	16.9	14.4	2.7
	Qwen2.5-VL	7B	25.2	29.5	17.6	18.3	13.0	15.5	11.8	20.0	21.7	36.5	13.3	16.7	10.5	16.7	16.4	20.1	18.2	3.7
	InternVL3	8B	28.8	26.9	17.9	20.0	17.4	22.5	27.5	13.3	4.4	17.3	13.3	16.7	7.9	24.1	19.4	23.4	9.9	3.4
	LLaVA-Video	72B	23.6	29.3	14.8	18.6	8.7	22.5	17.7	14.7	8.7	21.2	26.7	11.1	26.3	20.4	17.2	21.4	16.7	3.1
	LLaVA-OV	72B	23.3	26.9	14.3	18.1	8.7	14.1	19.6	13.3	21.7	26.9	20.0	22.2	10.5	25.9	15.7	23.4	14.4	3.2
	Qwen2.5-VL	72B	37.4	36.9	24.5	28.6	30.4	31.0	31.4	24.0	21.7	50.0	13.3	22.2	15.8	25.9	27.6	34.4	22.7	5.0
	InternVL3	78B	34.1	37.1	25.2	27.9	39.1	39.4	33.3	13.3	26.1	23.1	33.3	22.2	10.5	40.7	28.4	36.4	17.4	4.9
	Claude-3.7-sonnet	-	24.8	29.5	12.1	19.3	34.8	29.6	19.6	4.0	26.1	13.5	20.0	16.7	21.1	22.2	23.1	26.0	7.6	4.2
	GPT-4o	-	27.1	34.3	18.6	22.9	26.1	22.5	17.7	17.3	30.4	32.7	20.0	33.3	13.2	25.9	19.4	29.9	18.2	4.9
	Gemini-2.0-Flash	-	35.2	38.8	19.5	24.8	34.8	21.1	27.5	18.7	21.7	28.9	13.3	33.3	18.4	33.3	27.6	27.9	18.2	4.7
	GPT-o4-mini	-	49.8	61.4	42.1	44.8	43.5	49.3	45.1	40.0	65.2	63.5	20.0	72.2	23.7	31.5	45.5	44.8	42.4	6.9

🏅 VideoMathQA Leaderboard - Direct Answering 🎯

Accuracy on the VideoMathQA benchmark using direct answer format (no CoT reasoning) for MCQ and Multi-Binary tasks, with and without subtitles.
Shows model performance across mathematical concepts and video lengths. This leaderboard is sorted by results on Multi-Binary with subtitles.

#	Models	Size	MCQ		MBin		Mathematic Concepts										Duration
#	Models	Size	V	+Sub	V	+Sub	GAng	GAre	GLen	Chart	Stat	Arth	Topo	Grph	Cntg	Pzle	Short	Med	Long
	Claude-3.7-sonnet	-	26.2	27.1	8.6	9.5	17.4	9.9	5.9	8.0	17.4	11.5	13.3	5.6	5.3	9.3	8.2	11.0	9.1
	GPT-4o	-	20.2	24.5	12.6	13.6	13.0	12.7	15.7	12.0	4.4	17.3	20.0	5.6	7.9	20.4	14.2	15.6	10.6
	Gemini-2.0-Flash	-	28.6	31.7	14.1	20.5	30.4	23.9	27.5	13.3	8.7	19.2	13.3	16.7	7.9	33.3	25.4	24.0	11.4
	Gemini-1.5-Flash	-	20.5	23.1	12.6	17.6	26.1	15.5	19.6	9.3	17.4	23.1	6.7	22.2	15.8	24.1	17.9	22.1	12.1
	Qwen2.5-VL	3B	26.9	27.6	19.3	19.6	26.1	23.9	23.5	21.3	34.8	17.3	26.7	11.1	15.8	20.4	25.4	23.4	15.9
	InternVL2.5	2B	24.3	20.7	14.3	14.5	21.7	9.9	27.5	10.7	4.4	15.4	20.0	0.0	15.8	16.7	17.9	16.9	8.3
	PLM-LLaMA	3B	22.9	22.1	13.6	15.0	17.4	16.9	25.5	8.0	26.1	9.6	20.0	11.1	13.2	13.0	16.4	18.8	9.1
	InternVL3	2B	22.4	23.3	18.8	16.4	21.7	16.9	17.7	17.3	30.4	15.4	20.0	22.2	13.2	5.6	18.7	14.9	15.9
	PLM-LLaMA	8B	22.1	23.1	16.7	14.5	13.0	11.3	17.7	13.3	17.4	17.3	20.0	11.1	10.5	16.7	16.4	14.9	12.1
	Oryx-1.5	7B	22.6	22.6	16.9	17.4	13.0	23.9	23.5	9.3	21.7	23.1	20.0	5.6	18.4	11.1	20.2	20.8	10.6
	LLaVA-OV	7B	20.7	21.2	14.8	15.5	8.7	15.5	17.7	16.0	30.4	17.3	13.3	5.6	15.8	11.1	16.4	18.8	10.6
	LongVA-DPO	7B	21.4	21.7	16.2	14.1	8.7	15.5	17.7	12.0	30.4	9.6	6.7	5.6	10.5	18.5	14.9	11.7	15.9
	Video-R1	7B	21.4	17.4	16.0	16.2	8.7	22.5	25.5	16.0	26.1	13.5	6.7	5.6	13.2	9.3	16.4	16.9	15.2
	InternVL2.5	8B	24.3	24.8	18.6	18.6	26.1	19.7	17.7	17.3	21.7	19.2	26.7	11.1	10.5	20.4	17.9	22.7	14.4
	LLaVA-Video	7B	26.9	26.4	20.0	19.3	13.0	21.1	31.4	17.3	17.4	15.4	26.7	5.6	18.4	18.5	23.9	20.8	12.9
	InternVideo2.5	8B	25.2	28.6	19.1	19.1	34.8	22.5	15.7	14.7	21.7	19.2	20.0	27.8	10.5	18.5	18.7	22.1	15.9
	Qwen2.5-VL	7B	26.7	27.9	19.8	19.1	8.7	25.4	25.5	18.7	13.0	23.1	13.3	5.6	15.8	16.7	22.4	19.5	15.2
	InternVL3	8B	29.1	27.9	20.0	20.7	13.0	29.6	27.5	13.3	13.0	28.9	20.0	22.2	15.8	14.8	25.4	24.0	12.1
	VideoChat-R1	7B	27.6	29.1	21.2	21.2	8.7	22.5	31.4	21.3	17.4	30.8	6.7	11.1	15.8	18.5	26.9	20.1	16.7
	Aria	34B	23.8	26.4	17.4	19.1	8.7	25.4	19.6	22.7	17.4	19.2	20.0	11.1	21.1	11.1	21.6	16.9	18.9
	Oryx-1.5	32B	30.5	33.1	22.9	24.1	30.4	39.4	31.4	10.7	17.4	21.2	6.7	11.1	15.8	33.3	27.6	29.9	13.6
	Qwen2.5-VL	32B	32.4	32.6	25.7	24.8	43.5	31.0	25.5	14.7	26.1	26.9	6.7	27.8	10.5	33.3	28.4	30.5	14.4
	InternVL2.5	38B	31.0	33.6	24.1	26.0	43.5	38.0	39.2	8.0	13.0	32.7	6.7	11.1	18.4	29.6	34.3	31.8	10.6
	InternVL3	38B	31.7	35.7	25.2	29.5	34.8	42.3	37.3	13.3	17.4	25.0	13.3	33.3	26.3	40.7	35.8	38.3	12.9
	LLaVA-Video	72B	28.3	30.0	20.2	24.3	8.7	32.4	25.5	20.0	13.0	36.5	13.3	22.2	21.1	24.1	27.6	27.3	17.4
	LLaVA-OV	72B	25.5	28.3	21.0	24.8	17.4	31.0	23.5	12.0	21.7	38.5	20.0	27.8	18.4	31.5	30.6	28.6	14.4
	InternVL2.5	78B	33.3	31.7	28.3	27.9	39.1	36.6	31.4	18.7	26.1	32.7	26.7	27.8	13.2	27.8	33.6	35.1	13.6
	Qwen2.5-VL	72B	36.9	37.6	26.0	27.9	26.1	36.6	31.4	17.3	30.4	38.5	20.0	16.7	18.4	29.6	34.3	29.2	19.7
	InternVL3	78B	33.3	31.7	28.3	27.9	39.1	36.6	31.4	18.7	26.1	32.7	26.7	27.8	13.2	27.8	33.6	35.1	13.6

Overview and Analysis of VideoMathQA

Examples from the Benchmark

Example questions from the VideoMathQA illustrating the three reasoning types: Problem Focused, Concept Transfer, and Deep Comprehension. The benchmark includes evolving dynamics in a video, complex text prompts, five multiple-choice options, the expert-annotated step-by-step reasoning to solve the given problem, and the final correct answer as shown above.

Overview of VideoMathQA

The figure illustrates a) Distribution of questions and model performance across ten mathematical concepts in the VideoMathQA. The consistently low performance across all concepts reveals a significant gap in the ability of the current multimodal models to perform mathematical reasoning over videos. b) Distribution of video durations in VideoMathQA, highlighting a diverse range from short clips of 10s to long-videos up to 1hr. c) The three-stage annotation pipeline for VideoMathQA was performed by expert science graduates, who annotated detailed step-by-step reasoning trails, with each stage governed by strict quality assessment.

Effect of Video Length, Subtitles, and Frame Count on Multimodal Reasoning

The figure illustrates VideoMathQA performance (a) Across video duration categories, (b) The impact of subtitles, and (c) Effect of varying the number of input frames. Overall, models perform best on medium-length videos, and overall accuracy improves with the inclusion of subtitles and more frames during evaluation.

Understanding Model Limitations in VideoMathQA Reasoning

The figure shows a) Comparison among vision-blind, image-only, and video models, highlighting the need for video-level understanding to perform well in VideoMathQA. b) Distribution of questions in VideoMathQA across three difficulty levels for varying reasoning depths, and the relationship between performance and question difficulty across top-performing models. c) Error analysis based on CoT step evaluation. Most model errors stem from misunderstanding the question, where models misinterpret what the question asks or overlook critical multimodal cues.

VideoMathQA Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

🔥 Highlights

Benchmark Examples

Problem Focused

Concept Transfer

Deep Instructional Comprehension

🏅 VideoMathQA Leaderboard - Chain-of-Thought Reasoning 🧠

🏅 VideoMathQA Leaderboard - Direct Answering 🎯

Overview and Analysis of VideoMathQA

Examples from the Benchmark

Overview of VideoMathQA

Effect of Video Length, Subtitles, and Frame Count on Multimodal Reasoning

Understanding Model Limitations in VideoMathQA Reasoning

VideoMathQA
Benchmarking Mathematical Reasoning
via Multimodal Understanding in Videos