We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i. e. the combination of modalities within a range of temporal offsets.
Entity alignment is the task of linking entities with the same real-world identity from different knowledge graphs (KGs), which has been recently dominated by embedding-based methods.
For abstractive summarization, we propose a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two (the former is pretrained while the latter is not).
SOTA for Extractive Document Summarization on CNN / Daily Mail (using extra training data)
Modeling and synthesizing image noise is an important aspect in many computer vision applications.
In this study, we propose a deep neural network based few-shot learning approach for rolling bearing fault diagnosis with limited data.
Although existing CNN-based temporal frameworks attempt to address the sensitivity and drift problems by concurrently processing all input frames in the sequence, the existing state-of-the-art CNN-based framework is limited to 3d pose estimation of a single frame from a sequential input.
With the guidance of such map, we boost the performance of R101-Mask R-CNN on instance segmentation from 35. 7 mAP to 37. 9 mAP without modifying the backbone or network structure.
We construct novel JPEG, Fog, Gabor, and Snow adversarial attacks to simulate unforeseen adversaries and perform a careful study of adversarial robustness against these and existing distortion types.
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection.
SOTA for Semantic Segmentation on Cityscapes (using extra training data)