Self-supervised learning (SSL) methods targeting scene images have seen a rapid growth recently, and they mostly rely on either a dedicated dense matching mechanism or a costly unsupervised object discovery module.
QFD first trains a quantized (or binarized) representation as the teacher, then quantize the network using knowledge distillation (KD).
In this paper, we propose a method called synergistic self-supervised and quantization learning (SSQL) to pretrain quantization-friendly self-supervised models facilitating downstream deployment.
Few-shot recognition learns a recognition model with very few (e. g., 1 or 5) images per category, and current few-shot learning methods focus on improving the average accuracy over many episodes.
However, they lack theoretical support and cannot explain why predictions are good candidates for pseudo-labels in the deep learning paradigm.
It is easy to collect a dataset with noisy labels, but such noise makes networks overfit seriously and accuracies drop dramatically.
Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding.
Ranked #3 on Temporal Action Localization on EPIC-KITCHENS-100
Hence, previous methods optimize the compressed model layer-by-layer and try to make every layer have the same outputs as the corresponding layer in the teacher model, which is cumbersome.
Fine-grained image analysis (FGIA) is a longstanding and fundamental problem in computer vision and pattern recognition, and underpins a diverse set of real-world applications.
Learning from the web can ease the extreme dependence of deep learning on large-scale manually labeled datasets.
Multi-label image recognition is a challenging computer vision task of practical use.
Ranked #1 on Multi-Label Image Classification on VOC2007
Modern deep learning models require large amounts of accurately annotated data, which is often difficult to satisfy.
That is, a CNN has an inductive bias to naturally focus on objects, named as Tobias ("The object is at sight") in this paper.
However, the lack of bounding-box supervision makes its accuracy much lower than fully supervised object detection (FSOD), and currently modern FSOD techniques cannot be applied to WSOD.
In recent years, visual recognition on challenging long-tailed distributions, where classes often exhibit extremely imbalanced frequencies, has made great progress mostly based on various complex paradigms (e. g., meta learning).
We tackle the long-tailed visual recognition problem from the knowledge distillation perspective by proposing a Distill the Virtual Examples (DiVE) method.
Ranked #17 on Long-tail Learning on iNaturalist 2018
In this paper, we find that mixup constantly explores the representation space, and inspired by the exploration-exploitation dilemma in reinforcement learning, we propose mixup Without hesitation (mWh), a concise, effective, and easy-to-use training algorithm.
We argue that the teacher should give more freedom to the student feature's magnitude, and let the student pay more attention on mimicking the feature direction.
Ranked #1 on Knowledge Distillation on COCO (mAP metric)
The effectiveness of our approach has been demonstrated on both facial age and attractiveness estimation tasks.
Ranked #1 on Age Estimation on ChaLearn 2016
Weakly supervised object localization (WSOL) aims to localize objects with only image-level labels.
Ranked #2 on Weakly-Supervised Object Localization on CUB-200-2011 (Top-1 Localization Accuracy metric)
In this paper, we propose a principled end-to-end framework named deep decipher (D2) for SSL.
Among various research areas of CV, fine-grained image analysis (FGIA) is a longstanding and fundamental problem, and has become ubiquitous in diverse real-world applications.
We show that pre-trained weights on ImageNet improve the accuracy under the real-time action recognition setting.
Deep learning has achieved excellent performance in various computer vision tasks, but requires a lot of training examples with clean labels.
Ranked #23 on Image Classification on Clothing1M (using extra training data)
While practitioners have had an intuitive understanding of these observations, we do a comprehensive emperical analysis and demonstrate that: (1) the gains from SSL techniques over a fully-supervised baseline are smaller when trained from a pre-trained model than when trained from random initialization, (2) when the domain of the source data used to train the pre-trained model differs significantly from the domain of the target task, the gains from SSL are significantly higher and (3) some SSL methods are able to advance fully-supervised baselines (like Pseudo-Label).
Inspired by the coarse-to-fine hierarchical process, we propose an end-to-end RNN-based Hierarchical Attention (RNN-HA) classification model for vehicle re-identification.
To solve this problem, we propose an end-to-end trainable deep network which is inspired by the state-of-the-art fine-grained recognition model and is tailored for the FSFG task.
To be specific, our approach outperforms the previous state-of-the-art model named DeepLab v3 by 1. 5% on the PASCAL VOC 2012 val set and 0. 6% on the test set by replacing the Atrous Spatial Pyramid Pooling (ASPP) module in DeepLab v3 with the proposed Vortex Pooling.
Although traditionally binary visual representations are mainly designed to reduce computational and storage costs in the image retrieval research, this paper argues that binary visual representations can be applied to large scale recognition and detection problems in addition to hashing in retrieval.
In this paper, we propose a novel deep-based framework for action recognition, which improves the recognition accuracy by: 1) deriving more precise features for representing actions, and 2) reducing the asynchrony between different information streams.
Appropriate comments of code snippets provide insight for code functionality, which are helpful for program comprehension.
Reusable model design becomes desirable with the rapid expansion of computer vision and machine learning applications.
Ranked #11 on Single-object discovery on COCO_20k
In this paper, we propose Adaptive Feeding (AF) to combine a fast (but less accurate) detector and an accurate (but slow) detector, by adaptively determining whether an image is easy or hard and choosing an appropriate detector for it.
Similar experiments with ResNet-50 reveal that even for a compact network, ThiNet can also reduce more than half of the parameters and FLOPs, at the cost of roughly 1$\%$ top-5 accuracy drop.
The difficulty of image recognition has gradually increased from general category recognition to fine-grained recognition and to the recognition of some subtle attributes such as temperature and geolocation.
We first introduce a boosting-based approach to learn a correspondence structure which indicates the patch-wise matching probabilities between images from a target camera pair.
However, it is difficult to collect sufficient training images with precise labels in some domains such as apparent age estimation, head pose estimation, multi-label classification and semantic segmentation.
Ranked #1 on Head Pose Estimation on BJUT-3D
Our approach first leverages the complete information from given trajectories to construct a thermal transfer field which provides a context-rich way to describe the global motion pattern in a scene.
Large receptive field and dense prediction are both important for achieving high accuracy in pixel labeling tasks such as semantic segmentation.
Fine-grained image recognition is a challenging computer vision problem, due to the small inter-class variations caused by highly similar subordinate categories, and the large intra-class variations in poses, scales and rotations.
Moreover, on general image retrieval datasets, SCDA achieves comparable retrieval results with state-of-the-art general image retrieval approaches.
Our column generation based method can be further generalized from the triplet loss to a general structured learning based framework that allows one to directly optimize multivariate performance measures.
These semantic regions can be used to recognize pre-defined activities in crowd scenes.
This paper addresses the problem of handling spatial misalignments due to camera-view changes or human-pose variations in person re-identification.
With strong labels, our framework is able to achieve state-of-the-art results in both datasets.
Ranked #16 on Multi-Label Classification on PASCAL VOC 2007
In this paper we show that by carefully making good choices for various detailed but important factors in a visual recognition framework using deep learning features, one can achieve a simple, efficient, yet highly accurate image classification system.
Most existing works heavily rely on object / part detectors to build the correspondence between object parts by using object or object part annotations inside training images.
In computer vision, an entity such as an image or video is often represented as a set of instance vectors, which can be SIFT, motion, or deep learning feature vectors extracted from different parts of that entity.
In this paper, a new heat-map-based (HMB) algorithm is proposed for group activity recognition.
Based on this network, we further model people in the scene as packages while human activities can be modeled as the process of package transmission in the network.
In spite of the popularity of various feature compression methods, this paper argues that feature selection is a better choice than feature compression.