By combining these two attention mechanisms, a video SUMmarization model with Diversified Contextual Attention scheme is developed, namely SUM-DCA.
Second, we construct the plugable multi-modal scene retriever to retrieve scenes represented with pairs of an image and its stylized caption, which are similar to the query image or caption in the large-scale factual data.
Then the representation of each dialogue turn is aggregated by a hierarchical structure to form the passage information, which is utilized in the current turn of DST.
More specifically, we take advantage of a decision model to help the dialogue system decide whether to wait or answer.
And the arbitrator decides whether to wait or to make a response to the user directly.