Poster
Evaluation of Text-to-Video Generation Models: A Dynamics Perspective
Mingxiang Liao · hannan lu · Qixiang Ye · Wangmeng Zuo · Fang Wan · Tianyu Wang · Yuzhong Zhao · Jingdong Wang · Xinyu Zhang
Comprehensive and constructive evaluation protocols play an important role when developing sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore dynamics of video content. Such dynamics is an essential dimension measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V generation models, as well as improving existing evaluation metrics. In practice, we define a set of dynamics scores corresponding to multiple temporal granularities, and a new benchmark of text prompts under multiple dynamics grades. Upon the text prompt benchmark, we assess the generation capacity of T2V models, characterized by metrics of dynamics ranges and T2V alignment. Moreover, we analyze the relevance of existing metrics to dynamics metrics, improving them from the perspective of dynamics. Experiments show that DEVIL evaluation metrics enjoy up to about 90\% consistency with human ratings, demonstrating the potential to advance T2V generation models.
Live content is unavailable. Log in and register to view live content