Empathetic dialogue assembles emotion understanding, feeling projection, and appropriate response generation.
At the planning stage, we propose a PlanNet to generate the layout of the product and other visual components considering both the appearance features of the product and semantic features of the text, which improves the diversity and rationality of the layouts.
Accurate knowledge selection is critical in knowledge-grounded dialogue systems.
However, noun phrases (NPs) and relation phrases (RPs) in OKBs are not canonicalized and often appear in different paraphrased textual variants, which leads to redundant and ambiguous facts.
To this end, we contribute to advance the study of the proactive dialogue policy to a more natural and challenging setting, i. e., interacting dynamically with users.
Procedural Multimodal Documents (PMDs) organize textual instructions and corresponding images step by step.
In this paper, we design a novel unified parameter-efficient transfer learning framework that works effectively on both pure language and V&L tasks.
Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.
Specially, we adopt knowledge distillation from a vision-language pretrained model to improve image selection, which avoids any requirement on the existence and quality of image captions.
In the question answering(QA) task, multi-hop reasoning framework has been extensively studied in recent years to perform more efficient and interpretable answer reasoning on the Knowledge Graph(KG).
In the field of dialogue summarization, due to the lack of training data, it is often difficult for supervised summary generation methods to learn vital information from dialogue context with limited data.
Learning and analyzing rap lyrics is a significant basis for many web applications, such as music recommendation, automatic music categorization, and music information retrieval, due to the abundant source of digital music in the World Wide Web.
The current methods for the link prediction taskhavetwonaturalproblems:1)the relation distributions in KGs are usually unbalanced, and 2) there are many unseen relations that occur in practical situations.
To the best of our knowledge, there lacks a general framework that approaches multi-hop reasoning in mixed long-short distance reasoning scenarios.
Summarization of multimedia data becomes increasingly significant as it is the basis for many real-world applications, such as question answering, Web search, and so forth.
End-to-end speech translation poses a heavy burden on the encoder, because it has to transcribe, understand, and learn cross-lingual semantics simultaneously.
End-to-end speech translation, a hot topic in recent years, aims to translate a segment of audio into a specific language with an end-to-end model.
Enabling a mechanism to understand a temporal story and predict its ending is an interesting issue that has attracted considerable attention, as in case of the ROC Story Cloze Task (SCT).
Learning social media content is the basis of many real-world applications, including information retrieval and recommendation systems, among others.
In this paper, we propose the first model to be able to generate visually grounded questions with diverse types for a single image.