Global Object Proposals for Improving Multi-Sentence Video Descriptions

There has been significant progress in image captioning in recent years. The generation of video descriptions is still in its early stages; this is due to the complex nature of videos in comparison to images. Generating paragraph descriptions of a video is even more challenging. Amongst the main issues are temporal object dependencies and complex object-object relationships. Recently, many works are proposed on the generation of multi-sentence video descriptions. The majority of the proposed works are based on a two-step approach: 1) event proposals and 2) caption generation. While these approaches produce good results, they miss out on globally available information. Here we propose the use of global object proposals while generating the video captions. Experimental results on ActivityNet dataset illustrate that the use of global object proposals can produce more informative and correct captions. We also propose three scores to evaluate the object detection capacity of the generator. A qualitative comparison of captions generated by the proposed method and the state-of-the-art techniques proves the efficacy of the proposed method.

PDF
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Dense Video Captioning ActivityNet Captions ADV-INF + Global METEOR 16.36 # 2
BLEU-4 9.45 # 1
CIDEr 19.40 # 6
DIV-1 0.60 # 1
DIV-2 0.78 # 1
RE-4 0.05 # 1

Methods


No methods listed for this paper. Add relevant methods here