ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection

With the rapid development of Deepfake synthesis technology, our information security and personal privacy have been severely threatened in recent years. To achieve a robust Deepfake detection, researchers attempt to exploit the joint spatial-temporal information in the videos, like using recurrent networks and 3D convolutional networks. However, these spatial-temporal models remain room to improve. Another general challenge for spatial-temporal models is that people do not clearly understand what these spatial-temporal models really learn. To address these two challenges, in this paper, we propose an Interpretable Spatial-Temporal Video Transformer (ISTVT), which consists of a novel decomposed spatial-temporal self-attention and a self-subtract mechanism to capture spatial artifacts and temporal inconsistency for robust Deepfake detection. Thanks to this decomposition, we propose to interpret ISTVT by visualizing the discriminative regions for both spatial and temporal dimensions via the relevance (the pixel-wise importance on the input) propagation algorithm. We conduct extensive experiments on large-scale datasets, including FaceForensics++, FaceShifter, DeeperForensics, Celeb-DF, and DFDC datasets. Our strong performance of intra-dataset and cross-dataset Deepfake detection demonstrates the effectiveness and robustness of our method, and our visualization-based interpretability offers people insights into our model.

PDF

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods