Benchmarking Generic Video Editing
and Visual Effects
1Texas A&M University 2Visko Platform 3Abaka AI 4UW-Madison
VEFX-Reward scores on the 1–4 scale. Ranked by GeoAgg (α=2 for IF, β=1 for RQ, γ=1 for EE). Higher is better.
Updated: May 2, 2026
| Model | Type | IF ↑ | RQ ↑ | EE ↑ | GeoAgg ↑ |
|---|---|---|---|---|---|
| 1Kling o3 Omni | Commercial | 3.033 | 3.588 | 3.043 | 3.057 |
| 2Kling o1 | Commercial | 3.040 | 3.534 | 2.976 | 2.985 |
| 3Runway Gen-4.5 | Commercial | 2.817 | 3.319 | 2.923 | 2.912 |
| 4Seedance 2.0 | Commercial | 2.811 | 3.421 | 3.088 | 2.766 |
| 5Grok Imagine | Commercial | 2.606 | 3.346 | 3.376 | 2.723 |
| 6Luma Ray 3 | Commercial | 2.702 | 3.403 | 2.705 | 2.717 |
| 7UniVideo | Open-source | 2.294 | 3.266 | 3.091 | 2.516 |
| 8Wan 2.6 | Commercial | 2.012 | 3.317 | 2.446 | 2.146 |
| 9Luma Ray 2 | Commercial | 2.038 | 2.532 | 1.363 | 1.804 |
| 10VACE | Open-source | 2.027 | 3.172 | 1.180 | 1.775 |
Real editing prompts, real model outputs, real human scores. Each row shows the original video alongside edits from different models.
Video editing models can now follow complex instructions — zooming, removing objects, changing styles, adding elements. But how do we know which edits are actually good?
VEFX-Dataset provides 5,049 human-annotated video editing examples from 1,419 source videos across 9 categories and 32 subcategories, scored along three decoupled dimensions. We further release VEFX-Reward, a dedicated reward model, and VEFX-Bench, a benchmark of 300 curated pairs for standardized comparison.
Each edited video is scored on a 1–4 scale across three independent axes, capturing distinct aspects of quality.
Does the edit faithfully execute the given instruction? Measures semantic accuracy — whether the intended change actually happened.
Is the output visually clean and artifact-free? Evaluates temporal consistency, spatial fidelity, and overall visual quality of the edited video.
Are unrelated regions preserved? Captures whether the model only changed what was asked — without unintended side effects elsewhere.
Kling o3 Omni (3.057) and Kling o1 (2.985) top the leaderboard. UniVideo is the strongest open-source model (2.516), outperforming several commercial systems.
RQ scores are consistently higher than IF across all models — producing visually plausible edits is far easier than faithfully following editing instructions.
EE shows the widest spread (1.180–3.376), confirming that over-editing and unintended scene changes remain a major failure mode in current systems like VACE and Luma Ray 2.
@article{gao2025vefxbench,
title={VEFX-Bench: Benchmarking Generic Video Editing and Visual Effects},
author={Xiangbo Gao and Sicong Jiang and Bangya Liu and Xinghao Chen and Minglai Yang and Siyuan Yang and Mingyang Wu and Jiongze Yu and Qi Zheng and Haozhi Wang and Jiayi Zhang and Jared Yang and Jie Yang and Zihan Wang and Qing Yin and Zhengzhong Tu},
journal={arXiv preprint arXiv:2604.16272},
year={2026}
}