CVPR 2018

The most popular implementations from this conference
1
Card image cap
Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints
We present a novel approach for unsupervised learning of depth and ego-motion from monocular video. We combine this novel 3D-based loss with 2D losses based on photometric quality of frame reconstructions using estimated depth and ego-motion from adjacent frames.
2
Card image cap
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations.
3
Card image cap
MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks
We present MorphNet, an approach to automate the design of neural network structures. MorphNet iteratively shrinks and expands a network, shrinking via a resource-weighted sparsifying regularizer on activations and expanding via a uniform multiplicative factor on all layers.
4
Card image cap
Learning Transferable Architectures for Scalable Image Recognition
In our experiments, we search for the best convolutional layer (or "cell") on the CIFAR-10 dataset and then apply this cell to the ImageNet dataset by stacking together more copies of this cell, each with their own parameters to design a convolutional architecture, named "NASNet architecture". For instance, a small version of NASNet also achieves 74% top-1 accuracy, which is 3.1% better than equivalently-sized, state-of-the-art models for mobile platforms.
5
Card image cap
Learning Transferable Architectures for Scalable Image Recognition
In our experiments, we search for the best convolutional layer (or "cell") on the CIFAR-10 dataset and then apply this cell to the ImageNet dataset by stacking together more copies of this cell, each with their own parameters to design a convolutional architecture, named "NASNet architecture". For instance, a small version of NASNet also achieves 74% top-1 accuracy, which is 3.1% better than equivalently-sized, state-of-the-art models for mobile platforms.
6
Card image cap
The iNaturalist Species Classification and Detection Dataset
Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories. To encourage further progress in challenging real world conditions we present the iNaturalist species classification and detection dataset, consisting of 859,000 images from over 5,000 different species of plants and animals.
7
Card image cap
MegDet: A Large Mini-Batch Object Detector
The improvements in recent CNN-based object detection works, from R-CNN [11], Fast/Faster R-CNN [10, 31] to recent Mask R-CNN [14] and RetinaNet [24], mainly come from new network, new framework, or novel loss design. But mini-batch size, a key factor in the training, has not been well studied.
8
Card image cap
Squeeze-and-Excitation Networks
In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
9
Card image cap
SplineCNN: Fast Geometric Deep Learning with Continuous B-Spline Kernels
We present Spline-based Convolutional Neural Networks (SplineCNNs), a variant of deep neural networks for irregular structured and geometric input, e.g., graphs or meshes. Our main contribution is a novel convolution operator based on B-splines, that makes the computation time independent from the kernel size due to the local support property of the B-spline basis functions.
10
Card image cap
Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks
Rotation-invariant face detection, i.e. detecting faces with arbitrary rotation-in-plane (RIP) angles, is widely required in unconstrained applications but still remains as a challenging task, due to the large variations of face appearances. To address this problem more efficiently, we propose Progressive Calibration Networks (PCN) to perform rotation-invariant face detection in a coarse-to-fine manner.
11
Card image cap
Relation Networks for Object Detection
Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era. All state-of-the-art object detection systems still rely on recognizing object instances individually, without exploiting their relations during learning.
12
Card image cap
CondenseNet: An Efficient DenseNet using Learned Group Convolutions
Deep neural networks are increasingly used on mobile devices, where computational resources are limited. It combines dense connectivity with a novel module called learned group convolution.
13
Card image cap
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics.
14
Card image cap
A Closer Look at Spatiotemporal Convolutions for Action Recognition
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition.
15
Card image cap
SBNet: Sparse Blocks Network for Fast Inference
Conventional deep convolutional neural networks (CNNs) apply convolution operators uniformly in space across all feature maps for hundreds of layers - this incurs a high computational cost for real-time applications. For many problems such as object detection and semantic segmentation, we are able to obtain a low-cost computation mask, either from a priori problem knowledge, or from a low-resolution segmentation network.
16
Card image cap
LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation
FlowNet2, the state-of-the-art convolutional neural network (CNN) for optical flow estimation, requires over 160M parameters to achieve accurate flow estimation. In this paper we present an alternative network that outperforms FlowNet2 on the challenging Sintel final pass and KITTI benchmarks, while being 30 times smaller in the model size and 1.36 times faster in the running speed.
17
Card image cap
Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform
In this paper, we show that it is possible to recover textures faithful to semantic classes. In particular, we only need to modulate features of a few intermediate layers in a single network conditioned on semantic segmentation probability maps.
18
Card image cap
SPLATNet: Sparse Lattice Networks for Point Cloud Processing
We present a network architecture for processing point clouds that directly operates on a collection of points represented as a sparse set of samples in a high-dimensional lattice. Naively applying convolutions on this lattice scales poorly, both in terms of memory and computational cost, as the size of the lattice increases.
19
Card image cap
An Unsupervised Learning Model for Deformable Medical Image Registration
Current registration methods optimize an objective function independently for each pair of images, which can be time-consuming for large data. We define registration as a parametric function, and optimize its parameters given a set of images from a collection of interest.
20
Card image cap
Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform
In this paper, we show that it is possible to recover textures faithful to semantic classes. In particular, we only need to modulate features of a few intermediate layers in a single network conditioned on semantic segmentation probability maps.
21
Card image cap
Unsupervised Training for 3D Morphable Model Regression
The training loss is based on features from a facial recognition network, computed on-the-fly by rendering the predicted faces with a differentiable renderer. We train a regression network using these objectives, a set of unlabeled photographs, and the morphable model itself, and demonstrate state-of-the-art results.
22
Card image cap
PlaneNet: Piece-wise Planar Reconstruction from a Single RGB Image
While DNNs have brought remarkable progress to single-image depth prediction, piece-wise planar depthmap reconstruction requires a structured geometry representation, and has been a difficult task to master even for DNNs. The proposed end-to-end DNN learns to directly infer a set of plane parameters and corresponding plane segmentation masks from a single RGB image.
23
Card image cap
Learning a Single Convolutional Super-Resolution Network for Multiple Degradations
Recent years have witnessed the unprecedented success of deep convolutional neural networks (CNNs) in single image super-resolution (SISR). However, existing CNN-based SISR methods mostly assume that a low-resolution (LR) image is bicubicly downsampled from a high-resolution (HR) image, thus inevitably giving rise to poor performance when the true degradation does not follow this assumption.
24
Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer
In this paper, we present a novel method to generate synthetic human part segmentation data using easily-obtained human keypoint annotations. Using these estimated results as additional training data, our semi-supervised model outperforms its strong-supervised counterpart by 6 mIOU on the PASCAL-Person-Part dataset, and we achieve state-of-the-art human parsing results.
25
Card image cap
Real-Time Seamless Single Shot 6D Object Pose Prediction
For single object and multiple object pose estimation on the LINEMOD and OCCLUSION datasets, our approach substantially outperforms other recent CNN-based approaches when they are all used without post-processing. During post-processing, a pose refinement step can be used to boost the accuracy of the existing methods, but at 10 fps or less, they are much slower than our method.
26
Card image cap
Appearance-and-Relation Networks for Video Classification
Spatiotemporal feature learning in videos is a fundamental problem in computer vision. Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling.
27
Look at Boundary: A Boundary-Aware Face Alignment Algorithm
We present a novel boundary-aware face alignment algorithm by utilising boundary lines as the geometric structure of a human face to help facial landmark localisation. By utilising boundary information of 300-W dataset, our method achieves 3.92% mean error with 0.39% failure rate on COFW dataset, and 1.25% mean error on AFLW-Full dataset.
28
Card image cap
Two-Stream Convolutional Networks for Dynamic Texture Synthesis
We introduce a two-stream model for dynamic texture synthesis. Given an input dynamic texture, statistics of filter responses from the object recognition ConvNet encapsulate the per-frame appearance of the input texture, while statistics of filter responses from the optical flow ConvNet model its dynamics.
29
Card image cap
Nonlinear 3D Face Morphable Model
As a classic statistical model of 3D facial shape and texture, 3D Morphable Model (3DMM) is widely used in facial analysis, e.g., model fitting, image synthesis. Conventional 3DMM is learned from a set of well-controlled 2D face images with associated 3D face scans, and represented by two sets of PCA basis functions.
30
Card image cap
PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning
This paper presents a method for adding multiple tasks to a single deep neural network while avoiding catastrophic forgetting. Inspired by network pruning techniques, we exploit redundancies in large deep networks to free up parameters that can then be employed to learn new tasks.
31
Card image cap
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings -- the Room-to-Room (R2R) dataset.
32
Card image cap
Deep Ordinal Regression Network for Monocular Depth Estimation
These methods model depth estimation as a regression problem and train the regression networks by minimizing mean squared error, which suffers from slow convergence and unsatisfactory local solutions. By training the network using an ordinary regression loss, our method achieves much higher accuracy and \dd{faster convergence in synch}.
33
Card image cap
PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition
This is largely due to the difficulty in extracting local feature descriptors from a point cloud that can subsequently be encoded into a global descriptor for the retrieval task. We create benchmark datasets for point cloud based retrieval for place recognition, and the experimental results on these datasets show the feasibility of our PointNetVLAD.
34
Card image cap
Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN
We have shown that an IndRNN can be easily regulated to prevent the gradient exploding and vanishing problems while allowing the network to learn long-term dependencies. Experimental results have shown that the proposed IndRNN is able to process very long sequences (over 5000 time steps), can be used to construct very deep networks (21 layers used in the experiment) and still be trained robustly.
35
Card image cap
Defense against Adversarial Attacks Using High-Level Representation Guided Denoiser
First, with HGD as a defense, the target model is more robust to either white-box or black-box adversarial attacks. Second, HGD can be trained on a small subset of the images and generalizes well to other images and unseen classes.
36
Card image cap
Learning Transferable Architectures for Scalable Image Recognition
In our experiments, we search for the best convolutional layer (or "cell") on the CIFAR-10 dataset and then apply this cell to the ImageNet dataset by stacking together more copies of this cell, each with their own parameters to design a convolutional architecture, named "NASNet architecture". For instance, a small version of NASNet also achieves 74% top-1 accuracy, which is 3.1% better than equivalently-sized, state-of-the-art models for mobile platforms.
37
Card image cap
Detail-Preserving Pooling in Deep Networks
Most convolutional neural networks use some method for gradually downscaling the size of the hidden layers. This is commonly referred to as pooling, and is applied to reduce the number of parameters, improve invariance to certain distortions, and increase the receptive field size.
38
Card image cap
Decorrelated Batch Normalization
Batch Normalization (BN) is capable of accelerating the training of deep models by centering and scaling activations within mini-batches. In this work, we propose Decorrelated Batch Normalization (DBN), which not just centers and scales activations but whitens them.
39
Card image cap
Unsupervised Discovery of Object Landmarks as Structural Representations
Deep neural networks can model images with rich latent representations, but they cannot naturally conceptualize structures of object categories in a human-perceptible way. This paper addresses the problem of learning object structures in an image modeling process without supervision.
40
Card image cap
Squeeze-and-Excitation Networks
In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
41
Card image cap
Learning a Discriminative Feature Network for Semantic Segmentation
Most existing methods of semantic segmentation still suffer from two aspects of challenges: intra-class inconsistency and inter-class indistinction. To tackle these two problems, we propose a Discriminative Feature Network (DFN), which contains two sub-networks: Smooth Network and Border Network.
42
Card image cap
Discriminability objective for training descriptive captions
One property that remains lacking in image captions generated by contemporary methods is discriminability: being able to tell two images apart given the caption for one of them. We propose a way to improve this aspect of caption generation.
43
Card image cap
Conditional Probability Models for Deep Image Compression
Deep Neural Networks trained as image auto-encoders have recently emerged as a promising direction for advancing the state-of-the-art in image compression. During training, the auto-encoder makes use of the context model to estimate the entropy of its representation, and the context model is concurrently updated to learn the dependencies between the symbols in the latent representation.
44
Card image cap
IQA: Visual Question Answering in Interactive Environments
The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1.
45
Card image cap
Surface Networks
We study data-driven representations for three-dimensional triangle meshes, which are one of the prevalent objects used to represent 3D geometry. Recent works have developed models that exploit the intrinsic geometry of manifolds and graphs, namely the Graph Neural Networks (GNNs) and its spectral variants, which learn from the local metric tensor via the Laplacian operator.
46
Actor and Observer: Joint Modeling of First and Third-Person Videos
Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor). Despite this, learning such models for human action recognition has not been achievable due to the lack of data.
47
Card image cap
Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN
We have shown that an IndRNN can be easily regulated to prevent the gradient exploding and vanishing problems while allowing the network to learn long-term dependencies. Experimental results have shown that the proposed IndRNN is able to process very long sequences (over 5000 time steps), can be used to construct very deep networks (21 layers used in the experiment) and still be trained robustly.
48
Card image cap
Attentive Generative Adversarial Network for Raindrop Removal from a Single Image
In this paper, we address the problem by visually removing raindrops, and thus transforming a raindrop degraded image into a clean one. This injection of visual attention to both generative and discriminative networks is the main contribution of this paper.
49
Fast and Accurate Online Video Object Segmentation via Tracking Parts
Online video object segmentation is a challenging task as it entails to process the image sequence timely and accurately. To segment a target object through the video, numerous CNN-based methods have been developed by heavily finetuning on the object mask in the first frame, which is time-consuming for online applications.
50
Optimizing Video Object Detection via a Scale-Time Lattice
High-performance object detection relies on expensive convolutional networks to compute features, often leading to significant challenges in applications, e.g. those that require detecting objects from video streams in real time. The key to this problem is to trade accuracy for efficiency in an effective way, i.e. reducing the computing cost while maintaining competitive performance.
51
Card image cap
Adversarially Learned One-Class Classifier for Novelty Detection
Novelty detection is the process of identifying the observation(s) that differ in some respect from the training observations (the target class). Our architecture is composed of two deep networks, each of which trained by competing with each other while collaborating to understand the underlying concept in the target class, and then classify the testing samples.
52
Card image cap
Resource Aware Person Re-identification across Multiple Resolutions
Not all people are equally easy to identify: color statistics might be enough for some cases while others might require careful reasoning about high- and low-level details. However, prevailing person re-identification(re-ID) methods use one-size-fits-all high-level embeddings from deep convolutional networks for all cases.
53
Card image cap
Squeeze-and-Excitation Networks
In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
54
Card image cap
Revisiting Video Saliency: A Large-scale Benchmark and a New Model
Existing video saliency datasets lack variety and generality of common dynamic scenes and fall short in covering challenging situations in unconstrained environments. Second, we propose a novel video saliency model that augments the CNN-LSTM network architecture with an attention mechanism to enable fast, end-to-end saliency learning.
55
Card image cap
First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations
In this work we study the use of 3D hand poses to recognize first-person dynamic hand actions interacting with 3D objects. Our dataset and experiments can be of interest to communities of 3D hand pose estimation, 6D object pose, and robotics as well as action recognition.
56
Card image cap
Functional Map of the World
Our dataset consists of over 1 million images from over 200 countries. We present an analysis of the dataset along with baseline approaches that reason about metadata and temporal views.
57
Card image cap
CBMV: A Coalesced Bidirectional Matching Volume for Disparity Estimation
Recently, there has been a paradigm shift in stereo matching with learning-based methods achieving the best results on all popular benchmarks. The success of these methods is due to the availability of training data with ground truth; training learning-based systems on these datasets has allowed them to surpass the accuracy of conventional approaches based on heuristics and assumptions.
58
Card image cap
SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text
With recent advances in vision and language, we can start to tackle the problem of generating image captions that are both visually grounded and appropriately styled. We develop a model that learns to generate visually relevant styled captions from a large corpus of styled text without aligned images.
59
Card image cap
TextureGAN: Controlling Deep Image Synthesis with Texture Patches
In this paper, we investigate deep image synthesis guided by sketch, color, and texture. Previous image synthesis methods can be controlled by sketch and color strokes but we are the first to examine texture control.
60
Card image cap
Who Let The Dogs Out? Modeling Dog Behavior From Visual Data
We introduce the task of directly modeling a visually intelligent agent. Our model takes visual information as input and directly predicts the actions of the agent.
61
Card image cap
CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes
We demonstrate CSRNet on four datasets (ShanghaiTech dataset, the UCF_CC_50 dataset, the WorldEXPO'10 dataset, and the UCSD dataset) and we deliver the state-of-the-art performance. In the ShanghaiTech Part_B dataset, CSRNet achieves 47.3% lower Mean Absolute Error (MAE) than the previous state-of-the-art method.
62
Card image cap
Recurrent Saliency Transformation Network: Incorporating Multi-Stage Visual Cues for Small Organ Segmentation
Missing contextual information led to unsatisfying convergence in iterations, and that the fine stage sometimes produced even lower segmentation accuracy than the coarse stage. The key innovation is a saliency transformation module, which repeatedly converts the segmentation probability map from the previous iteration as spatial weights and applies these weights to the current iteration.
63
Card image cap
Recurrent Saliency Transformation Network: Incorporating Multi-Stage Visual Cues for Small Organ Segmentation
Missing contextual information led to unsatisfying convergence in iterations, and that the fine stage sometimes produced even lower segmentation accuracy than the coarse stage. The key innovation is a saliency transformation module, which repeatedly converts the segmentation probability map from the previous iteration as spatial weights and applies these weights to the current iteration.
64
Card image cap
Im2Flow: Motion Hallucination from Static Images for Action Recognition
Existing methods to recognize actions in static images take the images at their face value, learning the appearances---objects, scenes, and body poses---that distinguish each action class. Second, we show the power of hallucinated flow for recognition, successfully transferring the learned motion into a standard two-stream network for activity recognition.
65
Card image cap
Learning Time/Memory-Efficient Deep Architectures with Budgeted Super Networks
We propose to focus on the problem of discovering neural network architectures efficient in terms of both prediction quality and cost. Our contribution is a novel family of models called Budgeted Super Networks (BSN).
66
Card image cap
Im2Flow: Motion Hallucination from Static Images for Action Recognition
Existing methods to recognize actions in static images take the images at their face value, learning the appearances---objects, scenes, and body poses---that distinguish each action class. Second, we show the power of hallucinated flow for recognition, successfully transferring the learned motion into a standard two-stream network for activity recognition.
67
Card image cap
EPINET: A Fully-Convolutional Neural Network Using Epipolar Geometry for Depth from Light Field Images
Light field cameras capture both the spatial and the angular properties of light rays in space. Due to its property, one can compute the depth from light fields in uncontrolled lighting environments, which is a big advantage over active sensing devices.
68
Card image cap
Convolutional Sequence to Sequence Model for Human Dynamics
Human motion modeling is a classic problem in computer vision and graphics. Challenges in modeling human motion include high dimensional prediction as well as extremely complicated dynamics.We present a novel approach to human motion modeling based on convolutional neural networks (CNN).
69
Card image cap
Squeeze-and-Excitation Networks
In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
70
Card image cap
Correlation Tracking via Joint Discrimination and Reliability Learning
To address this issue, we propose a novel CF-based optimization problem to jointly model the discrimination and reliability information. The base filter is aimed to learn the discrimination information between the target and backgrounds, and the reliability term encourages the final filter to focus on more reliable regions.
71
Card image cap
CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes
We demonstrate CSRNet on four datasets (ShanghaiTech dataset, the UCF_CC_50 dataset, the WorldEXPO'10 dataset, and the UCSD dataset) and we deliver the state-of-the-art performance. In the ShanghaiTech Part_B dataset, CSRNet achieves 47.3% lower Mean Absolute Error (MAE) than the previous state-of-the-art method.
72
Card image cap
Mix and match networks: encoder-decoder alignment for zero-pair image translation
We address the problem of image translation between domains or modalities for which no direct paired data is available (i.e. zero-pair translation). We show our approach is scalable and can perform colorization and style transfer between unseen combinations of domains.
73
Card image cap
Deep Learning under Privileged Information Using Heteroscedastic Dropout
This is what the Learning Under Privileged Information (LUPI) paradigm endeavors to model by utilizing extra knowledge only available during training. Our method significantly increases the sample efficiency during learning, resulting in higher accuracy with a large margin when the number of training examples is limited.
74
Card image cap
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2 respectively). Specifically, GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers.
75
Card image cap
Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints
We present a novel approach for unsupervised learning of depth and ego-motion from monocular video. We combine this novel 3D-based loss with 2D losses based on photometric quality of frame reconstructions using estimated depth and ego-motion from adjacent frames.
76
Card image cap
Deep Image Prior
In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, super-resolution, and inpainting.
77
Card image cap
Latent RANSAC
We present a method that can evaluate a RANSAC hypothesis in constant time, i.e. independent of the size of the data. A key observation here is that correct hypotheses are tightly clustered together in the latent parameter domain.
78
Card image cap
Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN
We have shown that an IndRNN can be easily regulated to prevent the gradient exploding and vanishing problems while allowing the network to learn long-term dependencies. Experimental results have shown that the proposed IndRNN is able to process very long sequences (over 5000 time steps), can be used to construct very deep networks (21 layers used in the experiment) and still be trained robustly.
79
Card image cap
Adversarially Learned One-Class Classifier for Novelty Detection
Novelty detection is the process of identifying the observation(s) that differ in some respect from the training observations (the target class). Our architecture is composed of two deep networks, each of which trained by competing with each other while collaborating to understand the underlying concept in the target class, and then classify the testing samples.
80
Card image cap
Defense against Adversarial Attacks Using High-Level Representation Guided Denoiser
First, with HGD as a defense, the target model is more robust to either white-box or black-box adversarial attacks. Second, HGD can be trained on a small subset of the images and generalizes well to other images and unseen classes.
81
Card image cap
CondenseNet: An Efficient DenseNet using Learned Group Convolutions
Deep neural networks are increasingly used on mobile devices, where computational resources are limited. It combines dense connectivity with a novel module called learned group convolution.
82
Card image cap
Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task.
83
Card image cap
Decorrelated Batch Normalization
Batch Normalization (BN) is capable of accelerating the training of deep models by centering and scaling activations within mini-batches. In this work, we propose Decorrelated Batch Normalization (DBN), which not just centers and scales activations but whitens them.
84
Card image cap
Adversarial Complementary Learning for Weakly Supervised Object Localization
Although it is usually responsive to sparse parts of the target objects, this classifier can drive the counterpart classifier to discover new and complementary object regions by erasing its discovered regions from the feature maps. With such an adversarial learning, the two parallel-classifiers are forced to leverage complementary object regions for classification and can finally generate integral object localization together.
85
Card image cap
Learning to Extract a Video Sequence from a Single Motion-Blurred Image
We present a method to extract a video sequence from a single motion-blurred image. Firstly, averaging destroys the temporal ordering of the frames.
86
Card image cap
MegDet: A Large Mini-Batch Object Detector
The improvements in recent CNN-based object detection works, from R-CNN [11], Fast/Faster R-CNN [10, 31] to recent Mask R-CNN [14] and RetinaNet [24], mainly come from new network, new framework, or novel loss design. But mini-batch size, a key factor in the training, has not been well studied.
87
Card image cap
Squeeze-and-Excitation Networks
In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
88
Card image cap
Relation Networks for Object Detection
Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era. All state-of-the-art object detection systems still rely on recognizing object instances individually, without exploiting their relations during learning.
89
Card image cap
Relation Networks for Object Detection
Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era. All state-of-the-art object detection systems still rely on recognizing object instances individually, without exploiting their relations during learning.
90
Card image cap
Deep Learning under Privileged Information Using Heteroscedastic Dropout
This is what the Learning Under Privileged Information (LUPI) paradigm endeavors to model by utilizing extra knowledge only available during training. Our method significantly increases the sample efficiency during learning, resulting in higher accuracy with a large margin when the number of training examples is limited.
91
Card image cap
Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN
We have shown that an IndRNN can be easily regulated to prevent the gradient exploding and vanishing problems while allowing the network to learn long-term dependencies. Experimental results have shown that the proposed IndRNN is able to process very long sequences (over 5000 time steps), can be used to construct very deep networks (21 layers used in the experiment) and still be trained robustly.
92
Card image cap
Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN
We have shown that an IndRNN can be easily regulated to prevent the gradient exploding and vanishing problems while allowing the network to learn long-term dependencies. Experimental results have shown that the proposed IndRNN is able to process very long sequences (over 5000 time steps), can be used to construct very deep networks (21 layers used in the experiment) and still be trained robustly.
93
Card image cap
Image Super-Resolution via Dual-State Recurrent Networks
Advances in image super-resolution (SR) have recently benefited significantly from rapid developments in deep neural networks. Inspired by these recent discoveries, we note that many state-of-the-art deep SR architectures can be reformulated as a single-state recurrent neural network (RNN) with finite unfoldings.
94
Card image cap
An Unsupervised Learning Model for Deformable Medical Image Registration
Current registration methods optimize an objective function independently for each pair of images, which can be time-consuming for large data. We define registration as a parametric function, and optimize its parameters given a set of images from a collection of interest.
95
Card image cap
Real-Time Seamless Single Shot 6D Object Pose Prediction
For single object and multiple object pose estimation on the LINEMOD and OCCLUSION datasets, our approach substantially outperforms other recent CNN-based approaches when they are all used without post-processing. During post-processing, a pose refinement step can be used to boost the accuracy of the existing methods, but at 10 fps or less, they are much slower than our method.
96
Card image cap
Structure from Recurrent Motion: From Rigidity to Recurrency
This paper proposes a new method for Non-Rigid Structure-from-Motion (NRSfM) from a long monocular video sequence observing a non-rigid object performing recurrent and possibly repetitive dynamic action. Departing from the traditional idea of using linear low-order or lowrank shape model for the task of NRSfM, our method exploits the property of shape recurrency (i.e., many deforming shapes tend to repeat themselves in time).
97
Card image cap
CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes
We demonstrate CSRNet on four datasets (ShanghaiTech dataset, the UCF_CC_50 dataset, the WorldEXPO'10 dataset, and the UCSD dataset) and we deliver the state-of-the-art performance. In the ShanghaiTech Part_B dataset, CSRNet achieves 47.3% lower Mean Absolute Error (MAE) than the previous state-of-the-art method.
98
Card image cap
Squeeze-and-Excitation Networks
In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
99
Card image cap
Squeeze-and-Excitation Networks
In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.
100
Card image cap
Squeeze-and-Excitation Networks
In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%.