Spatio-Temporal Attention Models for Grounded Video Captioning

17 Oct 2016Mihai ZanfirElisabeta MarinoiuCristian Sminchisescu

Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts... (read more)

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.