A more realistic object detection paradigm, Open-World Object Detection, has arisen increasing research interests in the community recently.
Inspired by the recent success in Automated Machine Learning literature (AutoML), in this paper, we present Meta Navigator, a framework that attempts to solve the aforementioned limitation in few-shot learning by seeking a higher-level strategy and proffer to automate the selection from various few-shot learning designs.
Large-scale labeled training data is often difficult to collect, especially for person identities.
Video scene parsing is a long-standing challenging task in computer vision, aiming to assign pre-defined semantic labels to pixels of all frames in a given video.
To address this, this paper proposes to mine the contextual information beyond individual images to further augment the pixel representations.
For tackling such practical problem, we propose a Dual-Learner-based Video Highlight Detection (DL-VHD) framework.
Retrieving occlusion relation among objects in a single image is challenging due to sparsity of boundaries in image.
The nonlocal-based blocks are designed for capturing long-range spatial-temporal dependencies in computer vision tasks.
Given input images, scene graph generation (SGG) aims to produce comprehensive, graphical representations describing visual relationships among salient objects.
Contrastive learning applied to self-supervised representation learning has seen a resurgence in deep models.
In particular, we decouple the training of the representation and the classifier, and systematically investigate the effects of different data re-sampling techniques when training the whole network including a classifier as well as fine-tuning the feature extractor only.
Leveraging the advances of natural language processing, most recent scene text recognizers adopt an encoder-decoder architecture where text images are first converted to representative features and then a sequence of characters via `direct decoding'.
In this paper, we propose a new loss based on center predictivity, that is, a sample must be positioned in a location of the feature space such that from it we can roughly predict the location of the center of same-class samples.
It is also worth pointing that, given identical strong data augmentations, the performance improvement of ConTNet is more remarkable than that of ResNet.
We extract degradation prior at task-level with the proposed ConditionNet, which will be used to adapt the parameters of the basic SR network (BaseNet).
In this paper, we propose MINE to perform novel view synthesis and depth estimation via dense 3D reconstruction from a single image.
Most of these works achieve this by re-using the backbone network to extract features of selected regions.
Ranked #1 on Fine-Grained Image Classification on NABirds
Convolution has been the core ingredient of modern neural networks, triggering the surge of deep learning in vision.
Ranked #296 on Image Classification on ImageNet
In this paper, a retrieval-based coarse-to-fine framework is proposed, where we re-rank the TopN classification results by using the local region enhanced embedding features to improve the Top1 accuracy (based on the observation that the correct category usually resides in TopN results).
Ranked #6 on Fine-Grained Image Classification on CUB-200-2011
Most typical click models assume that the probability of a document to be examined by users only depends on position, such as PBM and UBM.
When choosing Chebyshev graph filter, a generalized formulation can be derived for explaining the existing nonlocal-based blocks (e. g. nonlocal block, nonlocal stage, double attention block) and uses to analyze their irrationality.
By disentangling representations on both image and instance levels, DIDN is able to learn domain-invariant representations that are suitable for generalized object detection.
A popular attempts towards the challenge is unpaired generative adversarial networks, which generate "real" LR counterparts from real HR images using image-to-image translation and then perform super-resolution from "real" LR->SR.
In this work, we propose TransTrack, a simple but efficient scheme to solve the multiple object tracking problems.
We identify that classification cost in matching cost is the main ingredient: (1) previous detectors only consider location cost, (2) by additionally introducing classification cost, previous detectors immediately produce one-to-one prediction during inference.
In particular, for real-time generation tasks, different devices require generators of different sizes due to varying computing power.
Specifically, our proposed network consists of three main parts: Siamese Encoder Module, Center Guiding Appearance Diffusion Module, and Dynamic Information Fusion Module.
In this paper, we study what would happen when normalization layers are removed from the network, and show how to train deep neural networks without normalization layers and without performance degradation.
In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location.
Ranked #43 on Object Detection on COCO minival
Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions.
Semi-supervised video object segmentation is an interesting yet challenging task in machine learning.
To perform action detection, we design a 3D convolution network with skip connections for tube classification and regression.
In this paper, a novel Context-and-Spatial Aware Network (CSANet), which integrates both a Context Aware Path and Spatial Aware Path, is proposed to obtain effective features involving both context information and spatial information.
Most previous models try to learn a fixed one-directional mapping between visual and semantic space, while some recently proposed generative methods try to generate image features for unseen classes so that the zero-shot learning problem becomes a traditional fully-supervised classification problem.
In this work, we propose a mask propagation network to treat the video segmentation problem as a concept of the guided instance segmentation.
Attention mechanisms have been widely used in Visual Question Answering (VQA) solutions due to their capacity to model deep cross-domain interactions.
Leveraging both visual frames and audio has been experimentally proven effective to improve large-scale video classification.
There has been a drastic growth of research in Generative Adversarial Nets (GANs) in the past few years.
In this way, the sequential representation of an image can be naturally translated to a sequence of words, as the target sequence of the RNN model.
Different from existing work where basic morphing types on the layer level were addressed, we target at the central problem of network morphism at a higher level, i. e., how a convolutional layer can be morphed into an arbitrary module of a neural network.
In this paper, we develop a Single frame Video Parsing (SVP) method which requires only one labeled frame per video in training stage.
A hierarchical shape parsing strategy is proposed to partition and organize image components into a hierarchical structure in the scale space.