VEFX-Bench

Benchmarking Generic Video Editing
and Visual Effects

Xiangbo Gao1,2 · Sicong Jiang3 · Bangya Liu3 · Xinghao Chen1 · Minglai Yang3 · Siyuan Yang1 · Mingyang Wu1 · Jiongze Yu1 · Qi Zheng2 · Haozhi Wang2 · Jiayi Zhang4 · Jared Yang2 · Jie Yang2 · Zihan Wang3 · Qing Yin2 · Zhengzhong Tu1,2

1Texas A&M University   2Visko Platform   3Abaka AI   4UW-Madison

Model Leaderboard

VEFX-Reward scores on the 1–4 scale. Ranked by GeoAgg (α=2 for IF, β=1 for RQ, γ=1 for EE). Higher is better.

Updated: May 2, 2026

For the latest results & submissions, visit the live leaderboard: vefx-leaderboard.com →
Model Type IF ↑ RQ ↑ EE ↑ GeoAgg ↑
1Kling o3 Omni Commercial 3.033 3.588 3.043 3.057
2Kling o1 Commercial 3.040 3.534 2.976 2.985
3Runway Gen-4.5 Commercial 2.817 3.319 2.923 2.912
4Seedance 2.0 Commercial 2.811 3.421 3.088 2.766
5Grok Imagine Commercial 2.606 3.346 3.376 2.723
6Luma Ray 3 Commercial 2.702 3.403 2.705 2.717
7UniVideo Open-source 2.294 3.266 3.091 2.516
8Wan 2.6 Commercial 2.012 3.317 2.446 2.146
9Luma Ray 2 Commercial 2.038 2.532 1.363 1.804
10VACE Open-source 2.027 3.172 1.180 1.775

See the benchmark in action.

Real editing prompts, real model outputs, real human scores. Each row shows the original video alongside edits from different models.

Editing Instruction Attribute Change Change the color of the red industrial trailer to a bright yellow while maintaining the texture and appearance of the metal surface.
ORIGINAL Source
Kling
IF: 4 RQ: 4 EE: 4
Grok
IF: 4 RQ: 4 EE: 3
Wan 2.6
IF: 4 RQ: 4 EE: 1
UniVideo
IF: 1 RQ: 1 EE: 1
Editing Instruction Object Removal Remove the woman with the grey backpack walking on the right side of the frame.
ORIGINAL Source
Grok
IF: 4 RQ: 4 EE: 4
Kling
IF: 4 RQ: 4 EE: 4
Wan 2.6
IF: 1 RQ: 4 EE: 1
UniVideo
IF: 2 RQ: 1 EE: 1
Editing Instruction Style Transfer Restore the natural, realistic colors to the entire scene, replacing the current black and white style with a full-color rendition.
ORIGINAL Source
Kling
IF: 4 RQ: 4 EE: 1
Wan 2.6
IF: 4 RQ: 4 EE: 3
Grok
IF: 3 RQ: 3 EE: 4
UniVideo
IF: 1 RQ: 1 EE: 1
Editing Instruction Camera Motion Perform a smooth zoom in on the distant snowy mountain peaks to create a more immersive view.
ORIGINAL Source
Kling
IF: 4 RQ: 4 EE: 3
VACE
IF: 2 RQ: 1 EE: 2
Grok
IF: 1 RQ: 4 EE: 4
Wan 2.6
IF: 1 RQ: 4 EE: 1
Editing Instruction Animate Animate the hedgehog to crawl slowly along the tree branch towards the woman with blonde hair.
ORIGINAL Source
Wan 2.6
IF: 4 RQ: 4 EE: 4
Grok
IF: 1 RQ: 3 EE: 3
Kling
IF: 1 RQ: 3 EE: 3
Luma Ray2
IF: 1 RQ: 3 EE: 3
Editing Instruction Background Replace Replace the entire urban background visible through the black hexagonal mesh with a vibrant, neon-lit cyberpunk city at night.
ORIGINAL Source
Grok
IF: 4 RQ: 4 EE: 4
Kling
IF: 2 RQ: 2 EE: 1
Wan 2.6
IF: 1 RQ: 2 EE: 1
UniVideo
IF: 1 RQ: 1 EE: 1

Editing video is easy.
Measuring quality is hard.

Video editing models can now follow complex instructions — zooming, removing objects, changing styles, adding elements. But how do we know which edits are actually good?

VEFX-Dataset provides 5,049 human-annotated video editing examples from 1,419 source videos across 9 categories and 32 subcategories, scored along three decoupled dimensions. We further release VEFX-Reward, a dedicated reward model, and VEFX-Bench, a benchmark of 300 curated pairs for standardized comparison.

Three dimensions of editing quality.

Each edited video is scored on a 1–4 scale across three independent axes, capturing distinct aspects of quality.

🎯
IF

Instruction Following

Does the edit faithfully execute the given instruction? Measures semantic accuracy — whether the intended change actually happened.

🎨
RQ

Rendering Quality

Is the output visually clean and artifact-free? Evaluates temporal consistency, spatial fidelity, and overall visual quality of the edited video.

🔒
EE

Edit Exclusivity

Are unrelated regions preserved? Captures whether the model only changed what was asked — without unintended side effects elsewhere.

Benchmark at a glance.

5,049
Annotated Examples
1,419
Source Videos
9 / 32
Categories / Subcategories
10
Editing Systems
3
Quality Dimensions
300
Benchmark Pairs

What we found.

📊

Kling Leads the Pack

Kling o3 Omni (3.057) and Kling o1 (2.985) top the leaderboard. UniVideo is the strongest open-source model (2.516), outperforming several commercial systems.

🎯

Visual Quality ≠ Faithfulness

RQ scores are consistently higher than IF across all models — producing visually plausible edits is far easier than faithfully following editing instructions.

⚠️

Locality Is the Biggest Gap

EE shows the widest spread (1.180–3.376), confirming that over-editing and unintended scene changes remain a major failure mode in current systems like VACE and Luma Ray 2.

Citation

@article{gao2025vefxbench,
  title={VEFX-Bench: Benchmarking Generic Video Editing and Visual Effects},
  author={Xiangbo Gao and Sicong Jiang and Bangya Liu and Xinghao Chen and Minglai Yang and Siyuan Yang and Mingyang Wu and Jiongze Yu and Qi Zheng and Haozhi Wang and Jiayi Zhang and Jared Yang and Jie Yang and Zihan Wang and Qing Yin and Zhengzhong Tu},
  journal={arXiv preprint arXiv:2604.16272},
  year={2026}
}