Image harmonization aims to generate a more realistic appearance of foreground and background for a composite image.
The major reason is that the positive pairs, i. e., different clips sampled from the same video, have limited temporal receptive field, and usually share similar background but differ in motions.
Specifically, graph nodes representing instance features are used for detection and segmentation while graph edges representing instance relations are used for tracking.
Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals.
Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks.
With the appearance of super high-resolution (e. g., gigapixel-level) images, performing efficient object detection on such images becomes an important issue.
To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision.
We conduct comprehensive comparison and detailed analysis on challenging benchmarks of DAVIS16, DAVIS17 and Youtube-VOS, demonstrating that the cyclic mechanism is helpful to enhance segmentation quality, improve the robustness of VOS systems, and further provide qualitative comparison and interpretation on how different VOS algorithms work.
We demonstrate that both temporal grains are beneficial to atomic action recognition.
Accurate prediction of the future given the past based on time series data is of paramount importance, since it opens the door for decision making and risk management ahead of time.
The crux of self-supervised video representation learning is to build general features from unlabeled videos.
The first stage locates the action by learning a temporal affine transform, which warps each video feature to its action duration while dismissing the action-irrelevant feature (e. g. background).
In addition, we add a localization branch to predict the localization accuracy, so that it can work as the replacement of the regression assistance link during inference.
Pedestrian detection in a crowd is a challenging task due to a high number of mutually-occluding human instances, which brings ambiguity and optimization difficulties to the current IoU-based ground truth assignment procedure in classical object detection methods.
In this paper, inspired by the curriculum learning, we analyze the barrel distortion rectification task in a progressive and meaningful manner.
In this paper, we address several inadequacies of current video object segmentation pipelines.
First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes.
The task of spatial-temporal action detection has attracted increasing attention among researchers.
Most current pipelines for spatio-temporal action localization connect frame-wise or clip-wise detection results to generate action proposals, where only local information is exploited and the efficiency is hindered by dense per-frame localization.
This paper alleviates this issue by proposing a novel framework to replace the classification task in one-stage detectors with a ranking task, and adopting the Average-Precision loss (AP-loss) for the ranking problem.
The experimental results show that PIoU loss can dramatically improve the performance of OBB detectors, particularly on objects with high aspect ratios and complex backgrounds.
How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations.
We demonstrate that the proposed method is able to boost the performance of existing pose estimation pipelines on our HiEve dataset.
The TRP trained network inherently has a low-rank structure, and is approximated with negligible performance loss, thus eliminating the fine-tuning process after low rank decomposition.
In this paper, we propose a novel paradigm of Spatial-Temporal Transformer Networks (STTNs) that leverages dynamical directed spatial dependencies and long-range temporal dependencies to improve the accuracy of long-term traffic forecasting.
To accelerate DNNs inference, low-rank approximation has been widely adopted because of its solid theoretical rationale and efficient implementations.
Micro-expressions are spontaneous, brief and subtle facial muscle movements that exposes underlying emotions.
This paper tries to fill the gap by introducing a novel large-scale dataset, the Amur Tiger Re-identification in the Wild (ATRW) dataset.
The task of re-identifying groups of people underdifferent camera views is an important yet less-studied problem. Group re-identification (Re-ID) is a very challenging task sinceit is not only adversely affected by common issues in traditionalsingle object Re-ID problems such as viewpoint and human posevariations, but it also suffers from changes in group layout andgroup membership.
For this purpose, we develop a novel optimization algorithm, which seamlessly combines the error-driven update scheme in perceptron learning and backpropagation algorithm in deep networks.
Network quantization is an effective method for the deployment of neural networks on memory and energy constrained mobile devices.
We propose Trained Rank Pruning (TRP), which iterates low rank approximation and training.
Depthwise separable convolution has shown great efficiency in network design, but requires time-consuming training procedure with full training-set available.
Based on the deeply supervised object detection (DSOD) framework, we propose Tiny-DSOD dedicating to resource-restricted usages.
Facial micro-expression (ME) recognition has posed a huge challenge to researchers for its subtlety in motion and limited databases.
In this paper, we propose a partition-masked Convolution Neural Network (CNN) to achieve compressed-video enhancement for the state-of-the-art coding standard, High Efficiency Video Coding (HECV).
In this paper, we propose two novel network quantization approaches, single-level network quantization (SLQ) for high-bit quantization and multi-level network quantization (MLQ) for extremely low-bit quantization (ternary). We are the first to consider the network quantization from both width and depth level.
This paper addresses the problem of unsupervised domain adaptation on the task of pedestrian detection in crowded scenes.
In this paper, we propose a novel deep-based framework for action recognition, which improves the recognition accuracy by: 1) deriving more precise features for representing actions, and 2) reducing the asynchrony between different information streams.
Object detection is an important yet challenging task in video understanding & analysis, where one major challenge lies in the proper balance between two contradictive factors: detection accuracy and detection speed.
Similar experiments with ResNet-50 reveal that even for a compact network, ThiNet can also reduce more than half of the parameters and FLOPs, at the cost of roughly 1$\%$ top-5 accuracy drop.
Part-based representation has been proven to be effective for a variety of visual applications.
We first introduce a boosting-based approach to learn a correspondence structure which indicates the patch-wise matching probabilities between images from a target camera pair.
In this paper, we propose a new framework for segmenting feature-based moving objects under affine subspace model.
Our approach first leverages the complete information from given trajectories to construct a thermal transfer field which provides a context-rich way to describe the global motion pattern in a scene.
Recognizing fine-grained sub-categories such as birds and dogs is extremely challenging due to the highly localized and subtle differences in some specific parts.
By adding a nonlinear post-processing step behind anisotropic filter banks, we demonstrate that the proposed filtering method is capable of preserving the local invariance of the fractal dimension of image.
These semantic regions can be used to recognize pre-defined activities in crowd scenes.
Facing to the challenges of trajectory clustering, e. g., large variations within a cluster and ambiguities across clusters, we first introduce an adaptive multi-kernel-based estimation process to estimate the `shrunk' positions and speeds of trajectories' points.
The visualization of an image collection is the process of displaying a collection of images on a screen under some specific layout requirements.
This paper addresses the problem of handling spatial misalignments due to camera-view changes or human-pose variations in person re-identification.
In this paper we show that by carefully making good choices for various detailed but important factors in a visual recognition framework using deep learning features, one can achieve a simple, efficient, yet highly accurate image classification system.
This paper presents a novel approach for automatic recognition of human activities for video surveillance applications.
This paper presents a novel approach for automatic recognition of group activities for video surveillance applications.
We demonstrate that this low-computation-complexity method can efficiently catch the characteristics of the frame.
Image deblurring techniques play important roles in many image processing applications.
In this paper, we propose a new intra-and-inter-constraint-based video enhancement approach aiming to 1) achieve high intra-frame quality of the entire picture where multiple region-of-interests (ROIs) can be adaptively and simultaneously enhanced, and 2) guarantee the inter-frame quality consistencies among video frames.
In this paper, a new heat-map-based (HMB) algorithm is proposed for group activity recognition.
Based on this network, we further model people in the scene as packages while human activities can be modeled as the process of package transmission in the network.