Further pre-training language models on in-domain data (domain-adaptive pre-training, DAPT) or task-relevant data (task-adaptive pre-training, TAPT) before fine-tuning has been shown to improve downstream tasks’ performances.
The datasets will be released to facilitate the development of video captioning metrics.
In this paper, we propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video by leveraging visual, audio and face information.
This paper explores how the coherence of different modelities of 3D data (e. g. point cloud, image, and mesh) can be used to improve data efficiency for both 3D classification and retrieval tasks.
Keyphrases are phrases in a document providing a concise summary of core content, helping readers to understand what the article is talking about in a minute.
In this paper, we introduce a flexible sufficient dimension reduction (SDR) method for Fr\'echet regression to achieve two purposes: to mitigate the curse of dimensionality caused by high-dimensional predictors, and to provide a tool for data visualization for Fr\'echet regression.
Our proposed novel self-supervised model learns two types of distinct features: modality-invariant features and modality-specific features.
Unlike the existing methods that use sparse LiDAR mainly in a manner of time-consuming iterative post-processing, our model fuses monocular image features and sparse LiDAR features to predict initial depth maps.
Ranked #1 on Depth Completion on KITTI
CDI builds the global attention and interaction among different levels in decoupled space which also solves the problem of heavy computation.
The rapid growth in published clinical trials makes it difficult to maintain up-to-date systematic reviews, which requires finding all relevant trials.
Siamese tracking has achieved groundbreaking performance in recent years, where the essence is the efficient matching operator cross-correlation and its variants.
Accurate localization on autonomous driving cars is essential for autonomy and driving safety, especially for complex urban streets and search-and-rescue subterranean environments where high-accurate GPS is not available.
Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition.
Ranked #1 on Skeleton Based Action Recognition on NTU RGB+D 120
However, due to the vast diversity of images, it is not optimal to use one model for all images, even different regions of one image.
In this paper, we present an automatic knowledge base construction system from large scale enterprise documents with minimal efforts of human intervention.
We propose a novel scene flow estimation approach to capture and infer 3D motions from point clouds.
In this paper, we show the existence of universal perturbations that can enable the targeted attack, e. g., forcing a tracker to follow the ground-truth trajectory with specified offsets, to be video-agnostic and free from inference in a network.
However, the most suitable positions for inferring different targets, i. e., the object category and boundaries, are generally different.
The one-shot multi-object tracking, which integrates object detection and ID embedding extraction into a unified network, has achieved groundbreaking results in recent years.
Inspired by the findings of our investigation, we propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Due to the rapid emergence of short videos and the requirement for content understanding and creation, the video captioning task has received increasing attention in recent years.
Specifically, a new generator architecture is proposed to simultaneously transfer color/texture styles and transform local facial shapes into anime-like counterparts based on the style of a reference anime-face, while preserving the global structure of the source photo-face.
Like other forward regression-based sufficient dimension reduction methods, our approach avoids the relatively stringent distributional requirements necessary for inverse regression alternatives.
Dimensionality Reduction Methodology
The model can naturally explain the repeatability of FRBs with a period ranging from a few days to several hundred days, but it generally requires that the eccentricity of the planet orbit should be large enough.
High Energy Astrophysical Phenomena
Fine-Grained Named Entity Typing (FG-NET) aims at classifying the entity mentions into a wide range of entity types (usually hundreds) depending upon the context.
We built an entity alignment model on top of XLM-RoBERTa to project the entities detected on the English part of the parallel data to the target language sentences, whose accuracy surpasses all previous unsupervised models.
In the second stage, we derive another warping model to refine warping results in less important regions by eliminating serious distortions in shape, disparity and 3D structure.
In this paper, we dissect the reasoning process of the aforementioned two tasks.
We first build a look-up-table (LUT) with the ground-truth mask in the starting frame, and then retrieves the LUT to obtain an attention map for spatial constraints.
In this paper, we propose a novel object-aware anchor-free network to address this issue.
Ranked #2 on Visual Object Tracking on VOT2019
Fine-Grained Named Entity Typing (FG-NET) is a key component in Natural Language Processing (NLP).
In this paper, we propose a complete video captioning system including both a novel model and an effective training strategy.
Ranked #1 on Video Captioning on MSR-VTT (using extra training data)
To tackle Named Entity Recognition (NER) tasks, supervised methods need to obtain sufficient cleanly annotated data, which is labor and time consuming.
Multi-modal information is essential to describe what has happened in a video.
The conventional von Neumann architecture has been revealed as a major performance and energy bottleneck for rising data-intensive applications.
Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning.
Furthermore, since different layers in a deep network capture feature maps of different scales, we use these feature maps to construct a spatial pyramid and then utilize multi-scale information to obtain more accurate attention scores, which are used to weight the local features in all spatial positions of feature maps to calculate attention maps.
As compared with traditional video retargeting, stereo video retargeting poses new challenges because stereo video contains the depth information of salient objects and its time dynamics.
We present a web-based interface that automatically assesses reading difficulty of Chinese texts.
In dynamic object detection, it is challenging to construct an effective model to sufficiently characterize the spatial-temporal properties of the background.
We propose a novel approach to sufficient dimension reduction in regression, based on estimating contour directions of negligible variation for the response surface.
We first establish a law of large numbers and a convergence theorem in distribution to show the rate of convergence of the non-local means filter for removing Gaussian noise.
In this paper, we propose a novel bilayer sparse coding model for illumination estimation that considers image similarity in terms of both low level color distribution and high level image scene content simultaneously.