More notably, our SDMP is the first method that successfully leverages data mixing to improve (rather than hurt) the performance of Vision Transformers in the self-supervised setting.
To enable training under this new setting, we convert the crowd count regression problem to a ranking potential prediction problem.
Although previous research can leverage generative priors to produce high-resolution results, their quality can suffer from the entangled semantics of the latent space.
This paper presents a Generative prior ReciprocAted Invertible rescaling Network (GRAIN) for generating faithful high-resolution (HR) images from low-resolution (LR) invertible images with an extreme upscaling factor (64x).
This novel merging scheme enables the self-attention to learn relationships between objects with different sizes and simultaneously reduces the token numbers and the computational cost.
Ultra-high resolution image segmentation has raised increasing interests in recent years due to its realistic applications.
We further leverage the derived segments to propose a crowd-aware fine-grained domain adaptation framework for crowd counting, which consists of two novel adaptation modules, i. e., Crowd Region Transfer (CRT) and Crowd Density Alignment (CDA).
In this way, we can transfer the original spatial labeling redundancy caused by individual similarities to effective supervision signals on the unlabeled regions.
In this paper, we introduce a new attention-based encoder, vision transformer, into salient object detection to ensure the globalization of the representations from shallow to deep layers.
This inborn property is used for two unique purposes: 1) regularizing the joint inversion process, such that each of the inverted code is semantically accessible from one of the other and fastened in a editable domain; 2) enforcing inter-image coherence, such that the fidelity of each inverted code can be maximized with the complement of other images.
Transformers recently are adapted from the community of natural language processing as a promising substitute of convolution-based neural networks for visual learning tasks.
Additionally, to exclude the information of the moving background objects from motion features, our transformation module enables to reciprocally transform the appearance features to enhance the motion features, so as to focus on the moving objects with salient appearance while removing the co-moving outliers.
Ranked #3 on Unsupervised Video Object Segmentation on DAVIS 2016 (using extra training data)
The key is to model the relationship between the query videos and the support images for propagating the object information.
To these ends, we introduce a trainable "master" network which ingests both audio signals and silent lip videos instead of a pretrained teacher.
Furthermore, our model runs at 35 FPS on a single GPU, which is efficient and applicable for real-time panorama HD map reconstruction.
In this way, arbitrary attributes can be edited by collecting positive data only, and the proposed method learns a controllable representation enabling manipulation of non-binary attributes like anime styles and facial characteristics.
Shadow removal is an important yet challenging task in image processing and computer vision.
Image matting is an ill-posed problem that usually requires additional user input, such as trimaps or scribbles.
Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc.
In this paper, we propose a simple yet effective approach, named Triple Excitation Network, to reinforce the training of video salient object detection (VSOD) from three aspects, spatial, temporal, and online excitations.
In this paper, we formulate a novel crowd analysis problem, in which we aim to predict the crowd distribution in the near future given sequential frames of a crowd video without any identity annotations.
It avoids the heavy computation of exhaustively searching all the cycle lengths in the video, and, instead, it propagates the coarse prediction for further refinement in a hierarchical manner.
This can be naturally generalized to span multiple scales with a Laplacian pyramid representation of the input data.
We conduct comparison experiments on this dataset and demonstrate that our model outperforms the state-of-the-art in tasks of recovering segmentation mask and appearance for occluded vehicles.
We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach.
Ranked #40 on Self-Supervised Action Recognition on HMDB51
We first propose a facial component guided deep Convolutional Neural Network (CNN) to restore a coarse face image, which is denoted as the base image where the facial component is automatically generated from the input face image.
In addition, we also propose a gated fusion scheme to control how the variations captured by the deformable convolution affect the original appearance.
Based on these findings, we present a scale-insensitive convolutional neural network (SINet) for fast detecting vehicles with a large variance of scales.
Egocentric videos, which mainly record the activities carried out by the users of the wearable cameras, have drawn much research attentions in recent years.
Experiments show that the proposed multi-task network outperforms existing multi-task architectures, and the auxiliary subitizing network provides strong guidance to salient object detection by reducing false positives and producing coherent saliency maps.
Two levels of features are derived from the global network and transferred to two parallel networks.
Numerous efforts have been made to design different low level saliency cues for the RGBD saliency detection, such as color or depth contrast features, background and color compactness priors.
Ranked #24 on RGB-D Salient Object Detection on NJU2K
In this paper, we present a real-time salient object detection system based on the minimum spanning tree.
Ranked #5 on Video Salient Object Detection on MCL (using extra training data)
A robust tracking framework based on the locality sensitive histograms is proposed, which consists of two main components: a new feature for tracking that is robust to illumination changes and a novel multi-region tracking algorithm that runs in realtime even with hundreds of regions.