|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Spatio-temporal action detection in videos requires localizing the action both spatially and temporally in the form of an "action tube".
It has been well recognized that modeling human-object or object-object relations would be helpful for detection task.
In this paper, we present a conceptually simple and general yet novel framework for few-shot temporal activity detection based on proposal regression which detects the start and end time of the activities in untrimmed videos.
We believe the introduction of the COIN dataset will promote the future in-depth research on instructional video analysis for the community.
Online temporal action localization from an untrimmed video stream is a challenging problem in computer vision.
We present a state-of-the-art audio-visual voice activity detection system and demonstrate that the learned embeddings can effectively localize to active speakers in the visual frames.
Our diarization system includes multiple modules, namely voice activity detection (VAD), segmentation, speaker embedding extraction, similarity scoring, clustering, resegmentation and overlap detection.
Afterwards, using these features and deep multiple instance learning along with the proposed ranking loss, our model learns to predict the abnormality score at the video segment level.