Memory-Attended Recurrent Network for Video Captioning

Typical techniques for video captioning follow the encoder-decoder framework, which can only focus on one source video being processed. A potential disadvantage of such design is that it cannot capture the multiple visual context information of a word appearing in more than one relevant videos in training data... (read more)

Results in Papers With Code
(↓ scroll down to see all results)