Video Saliency Prediction
11 papers with code • 0 benchmarks • 1 datasets
These leaderboards are used to track progress in Video Saliency Prediction
Our results suggest that (1) audio is a strong contributing cue for saliency prediction, (2) salient visible sound-source is the natural cause of the superiority of our Audio-Visual model, (3) richer feature representations for the input space leads to more powerful predictions even in absence of more sophisticated saliency decoders, and (4) Audio-Visual model improves over 53. 54\% of the frames predicted by the best Visual model (our baseline).
This paper investigates modifying an existing neural network architecture for static saliency prediction using two types of recurrences that integrate information from the temporal domain.
We further find from our database that there exists a temporal correlation of human attention with a smooth saliency transition across video frames.
Our model starts with a rough segmentation and quantifies several intuitive observations such as the effects of visual discomfort level, depth abruptness, motion acceleration, elements of surprise, size and compactness of the salient regions, and emphasizing only a few salient objects in a scene.
Due to a variety of motions across different frames, it is highly challenging to learn an effective spatiotemporal representation for accurate video saliency prediction (VSP).
When the base hierarchical model is empowered with domain-specific modules, performance improves, outperforming state-of-the-art models on three out of five metrics on the DHF1K benchmark and reaching the second-best results on the other two.
3D convolutional neural networks have achieved promising results for video tasks in computer vision, including video saliency prediction that is explored in this paper.