In the coarse labeling stage, the joint model outputs a bracketed tree, in which each node corresponds to one of four labels (i. e., phrase, subphrase, word, subword).
Thanks to the strong representation learning capability of deep learning, especially pre-training techniques with language model loss, dependency parsing has achieved great performance boost in the in-domain scenario with abundant labeled training data for target domains.
The proposed framework consists of a large language model (LLM), a diffusion-based image generator, and a series of visual rewards by design.
Zero-Shot Learning (ZSL), which aims at automatically recognizing unseen objects, is a promising learning paradigm to understand new real-world knowledge for machines continuously.
To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i. e., MM-Diffusion), with two-coupled denoising autoencoders.
To deal with that problem, in this paper, we propose a novel Multi-modal Siamese Network for Entity Alignment (MSNEA) to align entities in different MMKGs, in which multi-modal knowledge could be comprehensively leveraged by the exploitation of inter-modal effect.
Ranked #7 on Multi-modal Entity Alignment on UMVM-oea-d-w-v1 (using extra training data)
Recent years have witnessed the improving performance of Chinese Named Entity Recognition (NER) from proposing new frameworks or incorporating word lexicons.
In this survey on MMKGs constructed by texts and images, we first give definitions of MMKGs, followed with the preliminaries on multi-modal tasks and techniques.
Most previous studies of document-level event extraction mainly focus on building argument chains in an autoregressive way, which achieves a certain success but is inefficient in both training and inference.
Ranked #3 on Document-level Event Extraction on ChFinAnn
Specifically, during neural network training, we naturally model the noise samples in each batch following a hypergeometric distribution parameterized by the noise-rate.
This paper studies how to automatically generate a natural language text that describes the facts in knowledge graph (KG).
Several previous works on syntactic parsing propose to annotate shallow word-internal structures for better utilizing character-level information.
For global coherence, we design a hierarchical self-attentive architecture with both subgraph- and node-level attention to enhance the correlations between subgraphs.
Entity linking (EL) for the rapidly growing short text (e. g. search queries and news titles) is critical to industrial applications.
Complex node interactions are common in knowledge graphs (KGs), and these interactions can be considered as contextualized knowledge exists in the topological structure of KGs.
First, based on graph capsules, we adaptively learn aspect capsules for inferring the aspect sequence.
Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence.
NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading.
Specifically, we erase name regularity, mention coverage and context diversity respectively from the benchmarks, in order to explore their impact on the generalization ability of models.
Complex node interactions are common in knowledge graphs, and these interactions also contain rich knowledge information.