The datasets will be released to facilitate the development of video captioning metrics.
In this paper, we propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video by leveraging visual, audio and face information.
CDI builds the global attention and interaction among different levels in decoupled space which also solves the problem of heavy computation.
It can work in a purely data-driven manner and thus is capable of auto-creating a group of suitable convolutions for geometric shape modeling.
Siamese tracking has achieved groundbreaking performance in recent years, where the essence is the efficient matching operator cross-correlation and its variants.
Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition.
Ranked #1 on Skeleton Based Action Recognition on NTU RGB+D 120
In order to prove that the methods of different periods in the field of image classification have discrepancies on GasHisSDB, we select a variety of classifiers for evaluation.
In this paper, we first briefly review the development of Convolutional Neural Network and Visual Transformer in deep learning, and introduce the sources and development of conventional noises and adversarial attacks.
The results of the study indicate that deep learning models are robust to changes in the aspect ratio of cells in cervical cytopathological images.
In this paper, we show the existence of universal perturbations that can enable the targeted attack, e. g., forcing a tracker to follow the ground-truth trajectory with specified offsets, to be video-agnostic and free from inference in a network.
Finally, a comparative study is performed to test the generalizability with both H&E and immunohistochemical stained images on a lymphoma image dataset and a breast cancer dataset, producing comparable F1-scores (85. 6% and 82. 8%) and accuracies (83. 9% and 89. 4%), respectively.
However, the most suitable positions for inferring different targets, i. e., the object category and boundaries, are generally different.
The one-shot multi-object tracking, which integrates object detection and ID embedding extraction into a unified network, has achieved groundbreaking results in recent years.
Inspired by the findings of our investigation, we propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Due to the rapid emergence of short videos and the requirement for content understanding and creation, the video captioning task has received increasing attention in recent years.
The central core of the AnEn technique is a similarity metric that sorts historical forecasts with respect to a new target prediction.
Meanwhile, by using IoT observations, the spatial resolution of air temperature predictions is significantly improved.
We first build a look-up-table (LUT) with the ground-truth mask in the starting frame, and then retrieves the LUT to obtain an attention map for spatial constraints.
In this paper, we propose a novel object-aware anchor-free network to address this issue.
Ranked #2 on Visual Object Tracking on VOT2019
In this paper, we propose a complete video captioning system including both a novel model and an effective training strategy.
Ranked #1 on Video Captioning on MSR-VTT (using extra training data)
To reciprocate these two tasks, we design a two-stream structure to learn features on both the object level (i. e., bounding boxes) and the pixel level (i. e., instance masks) jointly.
Ranked #53 on Instance Segmentation on COCO test-dev
Unsupervised video object segmentation has often been tackled by methods based on recurrent neural networks and optical flow.
Ranked #4 on Unsupervised Video Object Segmentation on DAVIS 2016 (using extra training data)
Multi-modal information is essential to describe what has happened in a video.
In order to provide a meaningful probabilistic forecast, the AnEn method requires storing a historical set of past predictions and observations in memory for a period of at least several months and spanning the seasons relevant for the prediction of interest.
Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning.
In this paper we illustrate how to perform both visual object tracking and semi-supervised video object segmentation, in real-time, with a single simple approach.
Ranked #3 on Visual Object Tracking on YouTube-VOS
Correlation filters based trackers rely on a periodic assumption of the search sample to efficiently distinguish the target from the background.
During the off-line training phase, an effective sampling strategy is introduced to control this distribution and make the model focus on the semantic distractors.
Ranked #9 on Visual Object Tracking on VOT2017/18
Furthermore, since different layers in a deep network capture feature maps of different scales, we use these feature maps to construct a spatial pyramid and then utilize multi-scale information to obtain more accurate attention scores, which are used to weight the local features in all spatial positions of feature maps to calculate attention maps.
First, a novel cost-sensitive multi-task loss function is designed to learn transferable aging features by training on the source population.
The RASNet model reformulates the correlation filter within a Siamese tracking framework, and introduces different kinds of the attention mechanisms to adapt the model without updating the model online.
Ranked #3 on Visual Object Tracking on OTB-2013
In dynamic object detection, it is challenging to construct an effective model to sufficiently characterize the spatial-temporal properties of the background.
In this work, we present an end-to-end lightweight network architecture, namely DCFNet, to learn the convolutional features and perform the correlation tracking process simultaneously.
To address this, this paper presents a local subspace collaborative tracking method for robust visual tracking, where multiple linear and nonlinear subspaces are learned to better model the nonlinear relationship of object appearances.
In this paper, a multi-feature max-margin hierarchical Bayesian model (M3HBM) is proposed for action recognition.
During each training stage, the SRD model learns a relational dictionary to capture consistent relationships between face appearance and shape, which are respectively modeled by the pose-indexed image features and the shape displacements for current estimated landmarks.
In this paper, we model interactions between neighbor targets by pair-wise motion context, and further encode such context into the global association optimization.
In this paper, we propose a novel bilayer sparse coding model for illumination estimation that considers image similarity in terms of both low level color distribution and high level image scene content simultaneously.
In this paper, we formulate human action recognition as a novel Multi-Task Sparse Learning(MTSL) framework which aims to construct a test sample with multiple features from as few bases as possible.
In this paper, we propose a new global feature to capture the detailed geometrical distribution of interest points.