Inspired by recent advances in neural machine translation, that jointly align and translate using encoder-decoder networks equipped with attention, we propose an attentionbased LSTM model for human activity recognition.
We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test.
Ranked #32 on Video Retrieval on MSR-VTT
In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions.
DVS is an audio narration describing the visual elements and actions in a movie for the visually impaired.
In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions.
This paper proposes a semantic segmentation method for outdoor scenes captured by a surveillance camera.