Lip Sync Matters: A Novel Multimodal Forgery Detector

Deepfake technology has advanced a lot, but it is a double-sided sword for the community. One can use it for beneficial purposes, such as restoring vintage content in old movies, or for nefarious purposes, such as creating fake footage to manipulate the public and distribute non-consensual pornography. A lot of work has been done to combat its improper use by detecting fake footage with good performance thanks to the availability of numerous public datasets and unimodal deep learning-based models. However, these methods are insufficient to detect multimodal manipulations, such as both visual and acoustic. This work proposes a novel lip-reading-based multi-modal Deepfake detection method called “Lip Sync Matters.” It targets high-level semantic features to exploit the mismatch between the lip sequence extracted from the video and the synthetic lip sequence generated from the audio by the Wav2lip model to detect forged videos. Experimental results show that the proposed method outperforms several existing unimodal, ensemble, and multimodal methods on the publicly available multimodal FakeAVCeleb dataset.

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Benchmark
DeepFake Detection FakeAVCeleb AV-Lip-Sync Model Accuracy (%) 0.94 # 1
DeepFake Detection FakeAVCeleb Multimodal Ensemble Model Accuracy (%) 0.89 # 2


No methods listed for this paper. Add relevant methods here