VEFX-Bench

Benchmarking Generic Video Editing
and Visual Effects

Xiangbo Gao^1,2 · Sicong Jiang³ · Bangya Liu³ · Xinghao Chen¹ · Minglai Yang³ · Siyuan Yang¹ · Mingyang Wu¹ · Jiongze Yu¹ · Qi Zheng² · Haozhi Wang² · Jiayi Zhang⁴ · Jared Yang² · Jie Yang² · Zihan Wang³ · Qing Yin² · Zhengzhong Tu^1,2

¹Texas A&M University ²Visko Platform ³Abaka AI ⁴UW-Madison

Model Leaderboard

VEFX-Reward scores on the 1–4 scale. Ranked by GeoAgg (α=2 for IF, β=1 for RQ, γ=1 for EE). Higher is better.

Updated: May 2, 2026

Model	Type	IF ↑	RQ ↑	EE ↑	GeoAgg ↑
1Kling o3 Omni	Commercial	3.033	3.588	3.043	3.057
2Kling o1	Commercial	3.040	3.534	2.976	2.985
3Runway Gen-4.5	Commercial	2.817	3.319	2.923	2.912
4Seedance 2.0	Commercial	2.811	3.421	3.088	2.766
5Grok Imagine	Commercial	2.606	3.346	3.376	2.723
6Luma Ray 3	Commercial	2.702	3.403	2.705	2.717
7UniVideo	Open-source	2.294	3.266	3.091	2.516
8Wan 2.6	Commercial	2.012	3.317	2.446	2.146
9Luma Ray 2	Commercial	2.038	2.532	1.363	1.804
10VACE	Open-source	2.027	3.172	1.180	1.775

See the benchmark in action.

Real editing prompts, real model outputs, real human scores. Each row shows the original video alongside edits from different models.

Editing Instruction Attribute Change Change the color of the red industrial trailer to a bright yellow while maintaining the texture and appearance of the metal surface.

ORIGINAL Source

Kling

IF: 4 RQ: 4 EE: 4

Grok

IF: 4 RQ: 4 EE: 3

Wan 2.6

IF: 4 RQ: 4 EE: 1

UniVideo

IF: 1 RQ: 1 EE: 1

Editing Instruction Object Removal Remove the woman with the grey backpack walking on the right side of the frame.

ORIGINAL Source

Grok

IF: 4 RQ: 4 EE: 4

Kling

IF: 4 RQ: 4 EE: 4

Wan 2.6

IF: 1 RQ: 4 EE: 1

UniVideo

IF: 2 RQ: 1 EE: 1

Editing Instruction Style Transfer Restore the natural, realistic colors to the entire scene, replacing the current black and white style with a full-color rendition.

ORIGINAL Source

Kling

IF: 4 RQ: 4 EE: 1

Wan 2.6

IF: 4 RQ: 4 EE: 3

Grok

IF: 3 RQ: 3 EE: 4

UniVideo

IF: 1 RQ: 1 EE: 1

Editing Instruction Camera Motion Perform a smooth zoom in on the distant snowy mountain peaks to create a more immersive view.

ORIGINAL Source

Kling

IF: 4 RQ: 4 EE: 3

VACE

IF: 2 RQ: 1 EE: 2

Grok

IF: 1 RQ: 4 EE: 4

Wan 2.6

IF: 1 RQ: 4 EE: 1

Editing Instruction Animate Animate the hedgehog to crawl slowly along the tree branch towards the woman with blonde hair.

ORIGINAL Source

Wan 2.6

IF: 4 RQ: 4 EE: 4

Grok

IF: 1 RQ: 3 EE: 3

Kling

IF: 1 RQ: 3 EE: 3

Luma Ray2

IF: 1 RQ: 3 EE: 3

Editing Instruction Background Replace Replace the entire urban background visible through the black hexagonal mesh with a vibrant, neon-lit cyberpunk city at night.

ORIGINAL Source

Grok

IF: 4 RQ: 4 EE: 4

Kling

IF: 2 RQ: 2 EE: 1

Wan 2.6

IF: 1 RQ: 2 EE: 1

UniVideo

IF: 1 RQ: 1 EE: 1

Editing video is easy.
Measuring quality is hard.

Video editing models can now follow complex instructions — zooming, removing objects, changing styles, adding elements. But how do we know which edits are actually good?

VEFX-Dataset provides 5,049 human-annotated video editing examples from 1,419 source videos across 9 categories and 32 subcategories, scored along three decoupled dimensions. We further release VEFX-Reward, a dedicated reward model, and VEFX-Bench, a benchmark of 300 curated pairs for standardized comparison.

Three dimensions of editing quality.

Each edited video is scored on a 1–4 scale across three independent axes, capturing distinct aspects of quality.

🎯

Instruction Following

Does the edit faithfully execute the given instruction? Measures semantic accuracy — whether the intended change actually happened.

🎨

Rendering Quality

Is the output visually clean and artifact-free? Evaluates temporal consistency, spatial fidelity, and overall visual quality of the edited video.

🔒

Edit Exclusivity

Are unrelated regions preserved? Captures whether the model only changed what was asked — without unintended side effects elsewhere.

What we found.

📊

Kling Leads the Pack

Kling o3 Omni (3.057) and Kling o1 (2.985) top the leaderboard. UniVideo is the strongest open-source model (2.516), outperforming several commercial systems.

🎯

Visual Quality ≠ Faithfulness

RQ scores are consistently higher than IF across all models — producing visually plausible edits is far easier than faithfully following editing instructions.

⚠️

Locality Is the Biggest Gap

EE shows the widest spread (1.180–3.376), confirming that over-editing and unintended scene changes remain a major failure mode in current systems like VACE and Luma Ray 2.

Citation

@article{gao2025vefxbench,
  title={VEFX-Bench: Benchmarking Generic Video Editing and Visual Effects},
  author={Xiangbo Gao and Sicong Jiang and Bangya Liu and Xinghao Chen and Minglai Yang and Siyuan Yang and Mingyang Wu and Jiongze Yu and Qi Zheng and Haozhi Wang and Jiayi Zhang and Jared Yang and Jie Yang and Zihan Wang and Qing Yin and Zhengzhong Tu},
  journal={arXiv preprint arXiv:2604.16272},
  year={2026}
}

VEFX-Bench

Model Leaderboard

See the benchmark in action.

Editing video is easy.Measuring quality is hard.

Three dimensions of editing quality.

Instruction Following

Rendering Quality

Edit Exclusivity

Benchmark at a glance.

What we found.

Kling Leads the Pack

Visual Quality ≠ Faithfulness

Locality Is the Biggest Gap

Citation

Editing video is easy.
Measuring quality is hard.