Incorporating Background Knowledge into Video Description Generation

EMNLP 2018 · Spencer Whitehead, Heng Ji, Mohit Bansal, Shih-Fu Chang, Clare Voss ·

Most previous efforts toward video captioning focus on generating generic descriptions, such as, {``}A man is talking.{''} We collect a news video dataset to generate enriched descriptions that include important background knowledge, such as named entities and related events, which allows the user to fully understand the video content. We develop an approach that uses video meta-data to retrieve topically related news documents for a video and extracts the events and named entities from these documents. Then, given the video as well as the extracted events and entities, we generate a description using a Knowledge-aware Video Description network. The model learns to incorporate entities found in the topically related documents into the description via an entity pointer network and the generation procedure is guided by the event and entity types from the topically related documents through a knowledge gate, which is a gating mechanism added to the model{'}s decoder that takes a one-hot vector of these types. We evaluate our approach on the new dataset of news videos we have collected, establishing the first benchmark for this dataset as well as proposing a new metric to evaluate these descriptions.

PDF Abstract