In this paper, we treat the graphs as graphs on a super-graph, and propose a novel breadth first search based method for expanding the neighborhood on the super-graph for a new coming graph, such that the matching with the new graph can be efficiently performed within the constructed neighborhood.
Third, by investigating the advantages of both anchor-based and anchor-free models, we further augment AlignPS with an ROI-Align head, which significantly improves the robustness of re-id features while still keeping our model highly efficient.
Ranked #1 on Person Search on CUHK-SYSU
In this paper, we present a simple yet effective continual learning method for BIQA with improved quality prediction accuracy, plasticity-stability trade-off, and task-order/length robustness.
This work demonstrates that it is practicable for the blind people to feel the world through the brush in their hands.
On one hand, PointAugmenting decorates point clouds with corresponding point-wise CNN features extracted by pretrained 2D detection models, and then performs 3D object detection over the decorated point clouds.
We proposed the first framework to address this novel task, namely Context-Guided Person Search (CGPS), by investigating three levels of context clues (i. e., detection, memory and scene) in unconstrained natural images.
Based on the semantic priors, we further propose a context-aware image inpainting model, which adaptively integrates global semantics and local features in a unified image generator.
Combinatorial Optimization (CO) has been a long-standing challenging research topic featured by its NP-hard nature.
In this paper, we propose a generic model transfer scheme to make Convlutional Neural Networks (CNNs) interpretable, while maintaining their high classification accuracy.
Extensive experiments on the novel dataset as well as three existing datasets clearly demonstrate the effectiveness of the proposed framework for both group-based re-id tasks.
This paper considers a new problem of adapting a pre-trained model of human mesh reconstruction to out-of-domain streaming videos.
Ranked #6 on 3D Human Pose Estimation on 3DPW
Recently, adversarial attack has been applied to visual object tracking to evaluate the robustness of deep trackers.
For action recognition learning, 2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Specifically, based on a shared backbone network, we add a prediction head for a new dataset, and enforce a regularizer to allow all prediction heads to evolve with new data while being resistant to catastrophic forgetting of old data.
We propose a deep state space model for probabilistic time series forecasting whereby the non-linear emission model and transition model are parameterized by networks and the dependency is modeled by recurrent neural nets.
Despite their success in perception over the last decade, deep neural networks are also known ravenous to labeled data for training, which limits their applicability to real-world problems.
However, the impact of the pseudo-labeled samples' quality as well as the mining strategies for high quality training sample have rarely been studied in SSL.
In this paper, we investigate the complimentary roles of spatial and temporal information and propose a novel dynamic spatiotemporal network (DS-Net) for more effective fusion of spatiotemporal information.
This paper considers the setting of jointly matching and clustering multiple graphs belonging to different groups, which naturally rises in many realistic problems.
Ranked #1 on Graph Matching on Willow Object Class
This paper presents a hybrid approach by combing the interpretability of traditional search-based techniques for producing the edit path, as well as the efficiency and adaptivity of deep embedding models to achieve a cost-effective GED solver.
Single-path based differentiable neural architecture search has great strengths for its low computational cost and memory-friendly nature.
We then propose to recursively alternate the learning schemes of imitation and exploration to narrow the discrepancy between training and inference.
Generating diverse and natural human motion is one of the long-standing goals for creating intelligent characters in the animated world.
In this paper, we focus on exploring the fusion of images and point clouds for 3D object detection in view of the complementary nature of the two modalities, i. e., images possess more semantic information while point clouds specialize in distance sensing.
However, there are few works studying the data augmentation problem for VQA and none of the existing image based augmentation schemes (such as rotation and flipping) can be directly applied to VQA due to its semantic structure -- an $\langle image, question, answer\rangle$ triplet needs to be maintained correctly.
The latent code of the recent popular model StyleGAN has learned disentangled representations thanks to the multi-layer style-based generator.
Nevertheless, due to the distributional shift between images simulated in the laboratory and captured in the wild, models trained on databases with synthetic distortions remain particularly weak at handling realistic distortions (and vice versa).
Various point neural networks have been developed with isotropic filters or using weighting matrices to overcome the structure inconsistency on point clouds.
Instance-level denoising on the feature map is performed to enhance the detection to small and cluttered objects.
Mesh is a powerful data structure for 3D shapes.
The saliency annotations of head and eye movements for both original and augmented videos are collected and together constitute the ARVR dataset.
To train a robust decoder against the physical distortion from the real world, a distortion network based on 3D rendering is inserted between the encoder and the decoder to simulate the camera imaging process.
We also show how to extend our network to hypergraph matching, and matching of multiple graphs.
Ranked #1 on Graph Matching on PASCAL VOC
Computational models for blind image quality assessment (BIQA) are typically trained in well-controlled laboratory environments with limited generalizability to realistically distorted images.
In addition with its NP-completeness nature, another important challenge is effective modeling of the node-wise and structure-wise affinity across graphs and the resulting objective, to guide the matching procedure effectively finding the true matching against noises.
Ranked #6 on Graph Matching on PASCAL VOC
Most adversarial learning based video prediction methods suffer from image blur, since the commonly used adversarial and regression loss pair work rather in a competitive way than collaboration, yielding compromised blur effect.
In this work, we propose a novel hybrid method for scene text detection namely Correlation Propagation Network (CPN).
Regression trackers directly learn a mapping from regularly dense samples of target objects to soft labels, which are usually generated by a Gaussian function, to estimate target positions.
Crowd counting or density estimation is a challenging task in computer vision due to large scale variations, perspective distortions and serious occlusions, etc.
Ranked #4 on Crowd Counting on WorldExpo’10
First, to facilitate this novel research of fine-grained video caption, we collected a novel dataset called Fine-grained Sports Narrative dataset (FSN) that contains 2K sports videos with ground-truth narratives from YouTube. com.
Most human activity analysis works (i. e., recognition orãprediction) only focus on a single granularity, i. e., eitherãmodelling global motion based on the coarse level movement such as human trajectories orãforecasting future detailed action based on body partsâ movement such as skeleton motion.
Despite recent emergence of adversarial based methods for video prediction, existing algorithms often produce unsatisfied results in image regions with rich structural information (i. e., object boundary) and detailed motion (i. e., articulated body movement).
In conventional (multi-dimensional) marked temporal point process models, event is often encoded by a single discrete variable i. e. a marker.
Specifically, we learn adaptive correlation filters on the outputs from each convolutional layer to encode the target appearance.
Subsequently, the existing no-reference IQA algorithms, which were 5 opinion-aware approaches viz., NFERM, GMLF, DIIVINE, BRISQUE and BLIINDS2, and 8 opinion-unaware approaches viz., QAC, SISBLIM, NIQE, FISBLIM, CPBD, S3 and Fish_bb, were executed for the evaluation of the THz security image quality.
Second, we learn a correlation filter over a feature pyramid centered at the estimated target position for predicting scale changes.
This work make the first attempt to generate articulated human motion sequence from a single image.
Ranked #2 on Gesture-to-Gesture Translation on NTU Hand Digit
We introduce a Multiple Granularity Analysis framework for video segmentation in a coarse-to-fine manner.
However, most of the previous activity recognition methods do not offer a flexible and scalable scheme to handle the high order context modeling problem.
Towards this end, we propose a novel loopy recurrent neural network (Loopy RNN), which is capable of aggregating relationship information of two input images in a progressive/iterative manner and outputting the consolidated matching score in the final iteration.
Key to automatically generate natural scene images is to properly arrange among various spatial elements, especially in the depth direction.
In this paper, we model the background by a Recurrent Neural Network (RNN) with its units aligned with time series indexes while the history effect is modeled by another RNN whose units are aligned with asynchronous events to capture the long-range dynamics.
A variety of real-world processes (over networks) produce sequences of data whose complex temporal dynamics need to be studied.
We address the person re-identification problem by effectively exploiting a globally discriminative feature representation from a sequence of tracked human regions/patches.
Numerous single-image super-resolution algorithms have been proposed in the literature, but few studies address the problem of performance evaluation based on visual perception.
Our analysis and empirical results show that classes with more samples have higher impact on the feature learning.
Fine grained video action analysis often requires reliable detection and tracking of various interacting objects and human body parts, denoted as interactional object parsing.
Firstly, a novel EM-like learning framework is proposed to train the pixel-level deep convolutional neural network (DCNN) by seamlessly integrating weakly supervised data (i. e., massive bounding box annotations) with a small set of strongly supervised data (i. e., fully annotated hand segmentation maps) to achieve state-of-the-art hand segmentation performance.
The outputs of the last convolutional layers encode the semantic information of targets and such representations are robust to significant appearance variations.
To address this problem, we propose a deep convolutional neural network (CNN) for crowd counting, and it is trained alternatively with two related learning objectives, crowd density and crowd count.
Ranked #14 on Crowd Counting on WorldExpo’10
Inspired by the recent advance in sentence regularization for text classification, we introduce a Motion Part Regularization framework to mining discriminative semi-local groups of dense trajectories.
In this paper, we address the problem of long-term visual tracking where the target objects undergo significant appearance variation due to deformation, abrupt motion, heavy occlusion and out-of-the-view.
We propose multi-graph matching methods to incorporate the two aspects by boosting the affinity score, meanwhile gradually infusing the consistency as a regularizer.