Second, we construct the plugable multi-modal scene retriever to retrieve scenes represented with pairs of an image and its stylized caption, which are similar to the query image or caption in the large-scale factual data.
Then the representation of each dialogue turn is aggregated by a hierarchical structure to form the passage information, which is utilized in the current turn of DST.
More specifically, we take advantage of a decision model to help the dialogue system decide whether to wait or answer.
How to build a high-quality multi-domain dialogue system is a challenging work due to its complicated and entangled dialogue state space among each domain, which seriously limits the quality of dialogue policy, and further affects the generated response.
And the arbitrator decides whether to wait or to make a response to the user directly.
How to incorporate external knowledge into a neural dialogue model is critically important for dialogue systems to behave like real humans.