Cross-Modal and Hierarchical Modeling of Video and Text

ECCV 2018 Bowen ZhangHexiang HuFei Sha

Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action... (read more)

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.