State-of-the-art object detectors exploit multi-branch structure and predict objects at several different scales, although substantially boosted accuracy is acquired, low efficiency is inevitable as fragmented structure is hardware unfriendly.
Generally, existing approaches in dealing with out-of-distribution (OOD) detection mainly focus on the statistical difference between the features of OOD and in-distribution (ID) data extracted by the classifiers.
Inspired by the fact that modeling task relationships by multi-task learning can usually boost performance, we propose a novel method SoftCPT (Soft Context Sharing for Prompt Tuning) to fine-tune pre-trained vision-language models on multiple target few-shot tasks, simultaneously.
Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios.
Ranked #3 on Zero-Shot Action Recognition on Kinetics
To this end, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
PoER helps the neural networks to capture label-related features which contain the domain information first in shallow layers and then distills the label-discriminative representations out progressively, enforcing the neural networks to be aware of the characteristic of objects and background which is vital to the generation of domain-invariant features.
Training Deep Neural Networks (DNNs) is inherently subject to sensitive hyper-parameters and untimely feedbacks of performance evaluation.
We further show that such a process is equivalent to adding an adversarial perturbation to the model input, and thereby name our proposed approach as an the Adversarial Gradient Driven Exploration (AGE).
We compare the proposed method with Unrolled GAN (Metz et al. 2016), BourGAN (Xiao, Zhong, and Zheng 2018), PacGAN (Lin et al. 2018), VEEGAN (Srivastava et al. 2017) and ALI (Dumoulin et al. 2016) on 2D synthetic dataset, and results show that the diversity penalty module can help GAN capture much more modes of the data distribution.
Previous methods for skeleton-based gesture recognition mostly arrange the skeleton sequence into a pseudo picture or spatial-temporal graph and apply deep Convolutional Neural Network (CNN) or Graph Convolutional Network (GCN) for feature extraction.
This module is used to squeeze the object boundary from both inner and outer directions, which contributes to precise mask representation.
Specifically, KTNet is constructed on a base detector with intrinsic knowledge mining and relational knowledge constraints.
no code implementations • 11 Nov 2020 • Weijie Bian, Kailun Wu, Lejian Ren, Qi Pi, Yujing Zhang, Can Xiao, Xiang-Rong Sheng, Yong-Nan Zhu, Zhangming Chan, Na Mou, Xinchen Luo, Shiming Xiang, Guorui Zhou, Xiaoqiang Zhu, Hongbo Deng
For example, a simple attempt to learn the combination of feature A and feature B <A, B> as the explicit cartesian product representation of new features can outperform previous implicit feature interaction models including factorization machine (FM)-based models and their variations.
To address these issues, we propose a novel framework named Structure Learning Convolution (SLC) that enables to extend the traditional convolutional neural network (CNN) to graph domains and learn the graph structure for traffic forecasting.
Ranked #2 on Traffic Prediction on METR-LA
In this paper, we begin by first analyzing the design defects of feature pyramid in FPN, and then introduce a new feature pyramid architecture named AugFPN to address these problems.
Neural architecture search (NAS) is inherently subject to the gap of architectures during searching and validating.
Transferring existing image-based detectors to the video is non-trivial since the quality of frames is always deteriorated by part occlusion, rare pose, and motion blur.
Ranked #14 on Video Object Detection on ImageNet VID
In our model, we decouple character images into style representation and content representation, which facilitates more precise control of these two types of variables, thereby improving the quality of the generated results.
Point cloud processing is very challenging, as the diverse shapes formed by irregular points are often indistinguishable.
Ranked #16 on 3D Part Segmentation on ShapeNet-Part
For network architecture search (NAS), it is crucial but challenging to simultaneously guarantee both effectiveness and efficiency.
Traditional clustering methods often perform clustering with low-level indiscriminative representations and ignore relationships between patterns, resulting in slight achievements in the era of deep learning.
Under the new schema, the proposed method can achieve superior accuracy (WIDER FACE Val/Test -- Easy: 0. 910/0. 896, Medium: 0. 881/0. 865, Hard: 0. 780/0. 770; FDDB -- discontinuous: 0. 973, continuous: 0. 724).
Ranked #6 on Face Detection on FDDB
Specifically, the convolutional weight for local point set is forced to learn a high-level relation expression from predefined geometric priors, between a sampled point from this point set and the others.
Ranked #9 on 3D Point Cloud Classification on ModelNet40-C
Instead of relying on optical flow, this paper proposes a novel module called Progressive Sparse Local Attention (PSLA), which establishes the spatial correspondence between features across frames in a local region with progressively sparser stride and uses the correspondence to propagate features.
Convolutional neural networks (CNNs) are inherently subject to invariable filters that can only aggregate local inputs with the same topological structures.
Here our goal is to automatically find a compact neural network model with high performance that is suitable for mobile devices.
Specifically, given the image-level annotations, (1) we first develop a weakly-supervised detection (WSD) model, and then (2) construct an end-to-end multi-label image classification framework augmented by a knowledge distillation module that guides the classification model by the WSD model according to the class-level predictions for the whole image and the object-level visual features for object RoIs.
Ranked #9 on Multi-Label Classification on NUS-WIDE
This paper proposes a segment-free method for geometric rectification of a distorted document image captured by a hand-held camera.
To address this issue, we propose the Reinforced Evolutionary Neural Architecture Search (RE- NAS), which is an evolutionary method with the reinforced mutation for NAS.
Specifically, for confusing manmade objects, ScasNet improves the labeling coherence with sequential global-to-local contexts aggregation.
To obtain a comprehensive evaluation, we choose to include both float type features and binary ones.
(2) A multi-integer-embedding is employed for compressing the whole database, which is modeled by binary sparse representation with fixed sparsity.
Binary features have been incrementally popular in the past few years due to their low memory footprints and the efficient computation of Hamming distance between binary descriptors.
In our SGF, we use the tree distance on the segment graph to define the internal weight function of the filtering kernel, which enables the filter to smooth out high-contrast details and textures while preserving major image structures very well.
In this paper, we propose an efficient method for accurate extraction of these virtual visual cues from a curved document image.
Finally, to overcome the ineffectiveness of current methods in the road intersection, a fitting based road centerline connection algorithm is proposed.
A new approach to the problem has been raised which intends to match features of different modalities directly.
Subset selection from massive data with noised information is increasingly popular for various applications.
With this constraint, our method can learn a compact space, where highly similar pixels are grouped to share correlated sparse representations.
Hyperspectral unmixing, the process of estimating a common set of spectral bases and their corresponding composite percentages at each pixel, is an important task for hyperspectral analysis, visualization and understanding.