Benchmarking Generic Video Editing
and Visual Effects
1Texas A&M University 2Visko Platform 3Abaka AI 4UW-Madison
VEFX-Reward scores on the 1–4 scale. Ranked by GeoAgg (α=2 for IF, β=1 for RQ, γ=1 for EE). Higher is better.
Updated: May 2, 2026
| Model | Type | IF ↑ | RQ ↑ | EE ↑ | GeoAgg ↑ |
|---|---|---|---|---|---|
| 1Kling o3 Omni | Commercial | 3.033 | 3.588 | 3.043 | 3.057 |
| 2Kling o1 | Commercial | 3.040 | 3.534 | 2.976 | 2.985 |
| 3Runway Gen-4.5 | Commercial | 2.817 | 3.319 | 2.923 | 2.912 |
| 4Seedance 2.0 | Commercial | 2.811 | 3.421 | 3.088 | 2.766 |
| 5Grok Imagine | Commercial | 2.606 | 3.346 | 3.376 | 2.723 |
| 6Luma Ray 3 | Commercial | 2.702 | 3.403 | 2.705 | 2.717 |
| 7UniVideo | Open-source | 2.294 | 3.266 | 3.091 | 2.516 |
| 8Wan 2.6 | Commercial | 2.012 | 3.317 | 2.446 | 2.146 |
| 9Luma Ray 2 | Commercial | 2.038 | 2.532 | 1.363 | 1.804 |
| 10VACE | Open-source | 2.027 | 3.172 | 1.180 | 1.775 |
Real editing prompts, real model outputs, real human scores. Each row shows the original video alongside edits from different models.
Video editing models can now follow complex instructions — zooming, removing objects, changing styles, adding elements. But how do we know which edits are actually good?
VEFX-Dataset provides 5,049 human-annotated video editing examples from 1,419 source videos across 9 categories and 32 subcategories, scored along three decoupled dimensions. We further release VEFX-Reward, a dedicated reward model, and VEFX-Bench, a benchmark of 300 curated pairs for standardized comparison.
Each edited video is scored on a 1–4 scale across three independent axes, capturing distinct aspects of quality.
Does the edit faithfully execute the given instruction? Measures semantic accuracy — whether the intended change actually happened.
Is the output visually clean and artifact-free? Evaluates temporal consistency, spatial fidelity, and overall visual quality of the edited video.
Are unrelated regions preserved? Captures whether the model only changed what was asked — without unintended side effects elsewhere.
Kling o3 Omni (3.057) and Kling o1 (2.985) top the leaderboard. UniVideo is the strongest open-source model (2.516), outperforming several commercial systems.
RQ scores are consistently higher than IF across all models — producing visually plausible edits is far easier than faithfully following editing instructions.
EE shows the widest spread (1.180–3.376), confirming that over-editing and unintended scene changes remain a major failure mode in current systems like VACE and Luma Ray 2.
@article{gao2026vefx,
title={VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects},
author={Gao, Xiangbo and Jiang, Sicong and Liu, Bangya and Chen, Xinghao and Yang, Minglai and Yang, Siyuan and Wu, Mingyang and Yu, Jiongze and Zheng, Qi and Wang, Haozhi and others},
journal={arXiv preprint arXiv:2604.16272},
year={2026}
}