CVPR 2026 arXiv 2603.22918

EVA: Efficient Reinforcement Learning
for End-to-End Video Agent

Yaolun Zhang*, Ruohui Wang*†, Jiahao Wang*‡, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu

SenseTime Research

*Equal contribution   Corresponding author   Project lead

TL;DR

EVA transforms video MLLMs from passive frame-consumers into active planners — deciding what, when, and how to watch before touching any pixels. Result: 6–12% accuracy gain with 30% fewer visual tokens across six video understanding benchmarks.

Motivation

Existing video understanding approaches suffer from two fundamental limitations:

Passive MLLM

Processes all uniformly sampled frames at once. A single relevant frame is buried among hundreds of irrelevant ones — leading to visual misguidance and wrong answers.

✗ No selective attention

Traditional Agent

Adds tool calls, but remains perception-first: the agent sees uniform frames before reasoning. Fixed FPS and resolution prevent adaptive information gathering.

✗ Rigid, manual workflow

EVA (Ours)

Plans before perceiving: reads only the question, forms a strategy, then selects exactly which frames to fetch at what resolution — using 30% fewer visual tokens.

✓ Planning-before-perception
Comparison of passive, traditional agent, and EVA approaches

Figure 1. Three paradigms for video understanding. Passive MLLM uses uniform sampling. Traditional Agent adds tool calls but starts from frames. EVA plans from the query, then selects frames strategically.

Why Planning-Before-Perception Works

Avoids Visual Misguidance

Uniformly sampled frames often contain irrelevant actions that mislead the model. EVA establishes intent textually before any visual interaction, preventing distraction from noisy frames.

Saves Visual Tokens

By identifying which segments matter upfront, EVA fetches only necessary frames at optimal resolution — reducing visual token consumption by up to 30% compared to Qwen2.5-VL.

Active, Goal-Driven Perception

The iterative Summary → Plan → Action → Reflection loop enables the agent to autonomously adapt its strategy based on what it sees — like a detective gathering evidence.

Flexible Tool Interface

EVA's tools expose nframes, resize, start_time, and end_time — giving the agent full control over perception granularity. Traditional agents use fixed FPS.

Method

EVA uses a three-stage training pipeline to learn efficient video reasoning end-to-end.

Iterative Reasoning Loop

Summary
Summarize current observations and known facts
Plan
Decide what to look at next and why
Action
Issue tool call with precise params
Reflection
Evaluate evidence; repeat or answer

Three-Stage Training Pipeline

01
SFT
Supervised Fine-Tuning
10K samples from llava-video + cgbench.
Teaches tool-call formatting and basic frame-selection strategies via cold-start training.
Cold Start
02
KTO
Kahneman–Tversky Optimization
11K labeled trajectories (63% correct / 37% failure).
Corrects over/under-sampling failures from SFT without needing paired preferences.
Preference Correction
03
GRPO
Group Relative Policy Optimization
9.6K open-ended + 1.1K MC QA pairs.
Online RL with accuracy + format rewards. Data-enhanced pipeline mines hard cases.
Online RL

How EVA Works

EVA autonomously decides what, when, and how to watch — through an iterative plan–act–reflect loop. The animation below shows a real example.

Input Question
Q

"What sequence of actions did the camera wearer perform while adjusting and using a microscope for sample analysis?"

A: Place petri dish → add solution → adjust microscope → remove gloves
B: Place petri dish → adjust eyepieces → sanitize hands → operate microscope → store dish → retrieve another → remove gloves
C: Sanitize hands → place slide → capture images → adjust microscope
D: Clean microscope → place slide → adjust knobs → record data
0frames loaded
0visual tokens
EVA Reasoning
🧠 Planning Strategy

"I need to identify a microscope-related action sequence. Rather than loading all frames at once, let me first survey the entire video at low resolution to locate the relevant segment efficiently."

"I'll sample 256 frames across the full video at 0.1× resolution — minimal tokens, maximum temporal coverage."

Plans without seeing any frames yet — pure text reasoning
0frames loaded
0visual tokens
Tool Call
watch_video({
  start_time: 0,
  end_time:   6630,
  nframes:    256,
  resize:     0.1     // ← low resolution
})

"Surveying the full video at low resolution. Looking for microscope-related activity..."

256frames fetched
~10Kvisual tokens
Reflection
🔍 Window Identified

"I can see frames around t = 3500–4000s contain relevant microscope activity. The rest of the video is background or unrelated actions."

"I'll now zoom into this specific window at full resolution to identify the exact action sequence."

0s3500–4000s6630s
256frames used
segment found
Tool Call
watch_video({
  start_time: 3500,
  end_time:   4000,
  nframes:    50,
  resize:     1.0     // ← full resolution
})

"Zooming into t=3500–4000s at full resolution to identify each action in the sequence..."

50targeted frames
~14Ktotal tokens used
Correct Answer
Option B

Place petri dish → adjust eyepieces → sanitize hands → operate microscope → store dish → retrieve another → remove gloves

EVA (ours)
~14K tokens
Qwen2.5-VL
~20K tokens
Traditional Agent
~22K tokens · Wrong
Video Timeline
The model has not loaded any frames yet.

Evaluation Results

EVA-GRPO sets the best open-source score on every benchmark, while using far fewer frames than fixed-sampling baselines.

Long Video Understanding Benchmarks

Model LongVideoBench MLVU VideoMME Long / Overall LVBench
FramesAcc FramesAcc FramesAcc FramesAcc
Closed-Source Models
GPT-4o 3258.2 0.5 fps64.6 38465.3 / 71.9 6048.9
Gemini-1.5-Pro 3255.2 0.5 fps67.4 / 75.0 360033.1
Static Frame Sampling
ShareGPT4Video 1639.7 1646.4 1635.0 / 39.9
LongVA 25656.3 12846.2 / 52.6
VITA-1.5-7B 1647.1 / 56.1
Video-R1 3252.7 3260.2 3249.4 / 59.9 3235.3
VideoChat-R1 3249.1 3254.3 3246.2 / — 3234.3
Qwen2.5-VL 3243.2 3248.4 3244.7 / 53.6 3231.6
Adaptive Agent
VideoAgent 8749.0 / 56.0 25.529.3
FrameThinker 21.152.9 23.259.1 24.147.6 / — 23.936.6
VideoMTR 3251.0 / 59.0
Ours
EVA-SFT 33.8*49.9 46.7*52.3 26.6*45.8 / 56.0 56.2*26.5
EVA-KTO 35.6*53.2 28.7*57.4 24.1*45.1 / 56.5 34.5*36.0
EVA-GRPO 25.3*55.0 22.2*68.3 22.8*48.4 / 60.2 26.8*43.3

* Adaptive frame count — values are per-video averages. Bold = best open-source result per benchmark.

Video-Holmes Zero-Shot Benchmark

Model Frames SRIMCTCITA MHRPARCTI Overall
Closed-Source Models
GPT-4o32 50.049.638.830.044.039.237.042.0
Gemini-2.0-Flash 41.833.723.120.530.126.833.730.6
Open-Source Models
InternVL2.5-8B32 28.032.221.57.725.723.822.623.8
InternVL3-8B32 29.540.737.935.124.638.924.132.3
Qwen2.5-VL-7B32 38.434.817.630.027.118.625.227.8
SEED-Bench-R132 42.835.125.640.529.229.632.933.5
VideoChat-R132 42.138.824.539.529.527.829.533.0
Video-R132 48.641.728.934.531.033.635.636.5
Ours
EVA-SFT11.5* 44.533.726.439.523.231.932.232.6
EVA-KTO5.8* 42.836.222.739.522.932.031.232.9
EVA-GRPO36.8* 49.339.530.444.527.137.635.237.2

* Adaptive frame count — values are per-video averages. Bold = best open-source result per column.

BibTeX

@article{zhang2026eva,
  title   = {EVA: Efficient Reinforcement Learning for End-to-End Video Agent},
  author  = {Zhang, Yaolun and Wang, Ruohui and Wang, Jiahao and Tang, Yepeng
             and Zheng, Xuanyu and Duan, Haonan and Lu, Hao and Deng, Hanming
             and Lu, Lewei},
  journal = {arXiv preprint arXiv:2603.22918},
  year    = {2026}
}