Representation Learning through Multimodal Attention and Time-Sync Comments for Affective Video Content Analysis

ACM MM22 2022  ·  Jicai Pan, Shangfei Wang, Lin Fang ·

Although temporal patterns inherent in visual and audio signals are crucial for affective video content analysis, they have not been thoroughly explored yet. In this paper, we propose a novel Temporal-Aware Multimodal (TAM) method to fully capture the temporal information. Specifically, we design a cross-temporal multimodal fusion module that applies attention-based fusion to different modalities within and across video segments. As a result, it fully captures the temporal relations between different modalities. Furthermore, a single emotion label lacks supervision for learning representation of each segment, making temporal pattern mining difficult. We leverage time-synchronized comments (TSCs) as auxiliary supervision, since these comments are easily accessible and contain rich emotional cues. Two TSC-based self-supervised tasks are designed: the first aims to predict the emotional words in a TSC from video representation and TSC contextual semantics, and the second predicts the segment in which the TSC appears by calculating the correlation between video representation and TSC embedding. These self-supervised tasks are used to pre-train the cross-temporal multimodal fusion module on a large-scale video-TSC dataset, which is crawled from the web without labeling costs. These self-supervised pre-training tasks prompt the fusion module to perform representation learning on segments including TSC, thus capturing more temporal affective patterns. Experimental results on three benchmark datasets show that the proposed fusion module achieves state-of-the-art results in affective video content analysis. Ablation studies verify that after TSC-based pre-training, the fusion module learns more segments' affective patterns and achieves better performance.

PDF Abstract


Results from the Paper

 Ranked #1 on Video Emotion Recognition on Ekman6 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Video Emotion Recognition Ekman6 TAM w/o TSC Accuracy 60.64 # 2
Video Emotion Recognition Ekman6 TAM Accuracy 61.00 # 1


No methods listed for this paper. Add relevant methods here