Skip to yearly menu bar Skip to main content


Poster

On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection

Xiufeng Song · Xiaohong Liu · Xiao Guo · Jiache Zhang · Qirui Li · LEI BAI · Xiaoming Liu · Guangtao Zhai

[ ]
Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Large numbers of synthesized videos from diffusion models pose a great threat to information security and authenticity, leading to an increasing demand for generated content detection. However, existing video-level detection algorithms mainly focus on deepfake manipulation, which fails on generated content from diffusion models. To push the advancement of video forensics, we propose an innovative Multi-Modal Forgery Representation (MMFR) to discriminate fake videos from real ones. This representation takes advantage of the generalizable multimodal space from Large Multimodal Models (LMMs) to develop profound perceptive and comprehensive abilities. Based on the effective representation, we push forward the frontier of video forensics detection by proposing a powerful Transformer-based detector that integrates multimodal feature spaces. Aside from the generalizable MMFR feature space, the detector also introduces a novel In-and-Across Frame Attention to adopt spatial and temporal information as an auxiliary feature. In addition, we establish a high-quality dataset including videos generated from various diffusion-based algorithms. The evaluation of several benchmarks confirms the effectiveness of our detector on general content from unseen diffusion models.

Live content is unavailable. Log in and register to view live content