CVPR 2016

The most popular implementations from this conference
1
Card image cap
Convolutional Pose Machines
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation.
2
Card image cap
Rethinking the Inception Architecture for Computer Vision
Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks.
3
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
4
Card image cap
Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis
This paper studies a combination of generative Markov random field (MRF) models and discriminatively trained deep convolutional neural networks (dCNNs) for synthesizing 2D images. The generative MRF acts on higher-levels of a dCNN feature pyramid, controling the image layout at an abstract level.
5
Card image cap
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
This means that the super-resolution (SR) operation is performed in HR space. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space.
6
Card image cap
Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis
This paper studies a combination of generative Markov random field (MRF) models and discriminatively trained deep convolutional neural networks (dCNNs) for synthesizing 2D images. The generative MRF acts on higher-levels of a dCNN feature pyramid, controling the image layout at an abstract level.
7
Learning Deep Features for Discriminative Localization
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
8
Convolutional Pose Machines
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation.
9
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
10
Card image cap
Learning Deep Representations of Fine-grained Visual Descriptions
State-of-the-art methods for zero-shot visual recognition formulate learning as a joint embedding problem of images and side information. In these formulations the current best complement to visual features are attributes: manually encoded vectors describing shared characteristics among categories.
11
Card image cap
Real-time Action Recognition with Enhanced Motion Vector CNNs
The deep two-stream architecture exhibited excellent performance on video based action recognition. The most computationally expensive step in this approach comes from the calculation of optical flow which prevents it to be real-time.
12
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
13
Card image cap
Instance-aware Semantic Segmentation via Multi-task Network Cascades
In this paper, we present Multi-task Network Cascades for instance-aware semantic segmentation. We develop an algorithm for the nontrivial end-to-end training of this causal, cascaded structure.
14
Card image cap
Learning Multi-Domain Convolutional Neural Networks for Visual Tracking
Our algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state.
15
Card image cap
DeepFool: a simple and accurate method to fool deep neural networks
State-of-the-art deep neural networks have achieved impressive results on many image classification tasks. However, these same architectures have been shown to be unstable to small, well sought, perturbations of the images.
16
Card image cap
Neural Module Networks
Visual question answering is fundamentally compositional in nature---a question like "where is the dog?" This paper seeks to simultaneously exploit the representational capacity of deep networks and the compositional linguistic structure of questions.
17
Card image cap
Convolutional Two-Stream Network Fusion for Video Action Recognition
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information.
18
Card image cap
Training Region-based Object Detectors with Online Hard Example Mining
Our motivation is the same as it has always been -- detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Moreover, combined with complementary advances in the field, OHEM leads to state-of-the-art results of 78.9% and 76.3% mAP on PASCAL VOC 2007 and 2012 respectively.
19
Card image cap
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
This means that the super-resolution (SR) operation is performed in HR space. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space.
20
Card image cap
Learning Deep Representations of Fine-grained Visual Descriptions
State-of-the-art methods for zero-shot visual recognition formulate learning as a joint embedding problem of images and side information. In these formulations the current best complement to visual features are attributes: manually encoded vectors describing shared characteristics among categories.
21
Card image cap
Learning Deep Feature Representations with Domain Guided Dropout for Person Re-identification
Learning generic and robust feature representations with data from multiple domains for the same problem is of great value, especially for the problems that have multiple datasets but none of them are large enough to provide abundant data variations. In this work, we present a pipeline for learning deep feature representations from multiple domains with Convolutional Neural Networks (CNNs).
22
Card image cap
Quantized Convolutional Neural Networks for Mobile Devices
Recently, convolutional neural networks (CNN) have demonstrated impressive performance in various computer vision tasks. However, high performance hardware is typically indispensable for the application of CNN models due to the high computation complexity, which prohibits their further extensions.
23
Card image cap
Learning Deep Features for Discriminative Localization
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
24
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
25
Card image cap
Learning Deep Features for Discriminative Localization
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
26
Card image cap
Joint Unsupervised Learning of Deep Representations and Image Clusters
In this paper, we propose a recurrent framework for Joint Unsupervised LEarning (JULE) of deep representations and image clusters. In our framework, successive operations in a clustering algorithm are expressed as steps in a recurrent process, stacked on top of representations output by a Convolutional Neural Network (CNN).
27
Card image cap
Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs
This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes on the learned classification network to localize each action instance.
28
Card image cap
LocNet: Improving Localization Accuracy for Object Detection
We propose a novel object localization methodology with the purpose of boosting the localization accuracy of state-of-the-art object detection systems. Our model, given a search region, aims at returning the bounding box of an object of interest inside this region.
29
Card image cap
Volumetric and Multi-View CNNs for Object Classification on 3D Data
Empirical results from these two types of CNNs exhibit a large gap, indicating that existing volumetric CNN architectures and approaches are unable to fully exploit the power of 3D representations. Overall, we are able to outperform current state-of-the-art methods for both volumetric CNNs and multi-view CNNs.
30
Card image cap
Shallow and Deep Convolutional Networks for Saliency Prediction
The prediction of salient areas in images has been traditionally addressed with hand-crafted features based on neuroscience principles. This paper, however, addresses the problem with a completely data-driven approach by training a convolutional neural network (convnet).
31
Card image cap
Context Encoders: Feature Learning by Inpainting
We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s).
32
Card image cap
DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation
This paper considers the task of articulated human pose estimation of multiple people in real world images. We propose an approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other.
33
Card image cap
DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation
This paper considers the task of articulated human pose estimation of multiple people in real world images. We propose an approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other.
34
Card image cap
Compact Bilinear Pooling
Bilinear models has been shown to achieve impressive performance on a wide range of visual tasks, such as semantic segmentation, fine grained recognition and face recognition. However, bilinear features are high dimensional, typically on the order of hundreds of thousands to a few million, which makes them impractical for subsequent analysis.
35
Card image cap
TI-POOLING: transformation-invariant pooling for feature learning in Convolutional Neural Networks
In this paper we present a deep neural network topology that incorporates a simple to implement transformation invariant pooling operator (TI-POOLING). This more efficient use of training data results in better performance on popular benchmark datasets with smaller number of parameters when comparing to standard convolutional neural networks with dataset augmentation and to other baselines.
36
Card image cap
Generation and Comprehension of Unambiguous Object Descriptions
We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene.
37
Card image cap
Weakly Supervised Deep Detection Networks
Weakly supervised learning of object detection is an important problem in image understanding that still does not have a satisfactory solution. In this paper, we address this problem by exploiting the power of deep convolutional neural networks pre-trained on large-scale image-level classification tasks.
38
Card image cap
Learning Deep Features for Discriminative Localization
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
39
Card image cap
Stacked Attention Networks for Image Question Answering
This paper presents stacked attention networks (SANs) that learn to answer natural language questions from images. Thus, we develop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively.
40
Card image cap
TGIF: A New Dataset and Benchmark on Animated GIF Description
With the recent popularity of animated GIFs on social media, there is need for ways to index them with rich metadata. The motivation for this work is to develop a testbed for image sequence description systems, where the task is to generate natural language descriptions for animated GIFs or video clips.
41
Card image cap
Compact Bilinear Pooling
Bilinear models has been shown to achieve impressive performance on a wide range of visual tasks, such as semantic segmentation, fine grained recognition and face recognition. However, bilinear features are high dimensional, typically on the order of hundreds of thousands to a few million, which makes them impractical for subsequent analysis.
42
Card image cap
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data
While recent deep neural network models have achieved promising results on the image captioning task, they rely largely on the availability of corpora with paired image and sentence captions to describe objects in context. Current deep caption models can only describe objects contained in paired image-sentence corpora, despite the fact that they are pre-trained with large object recognition datasets, namely ImageNet.
43
Card image cap
Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation
Deep neural networks with alternating convolutional, max-pooling and decimation layers are widely used in state of the art architectures for computer vision. Max-pooling purposefully discards precise spatial information in order to create features that are more robust, and typically organized as lower resolution spatial feature maps.
44
Card image cap
Learning Deep Features for Discriminative Localization
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
45
Card image cap
Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimising Global Loss Functions
Current results from machine learning show that replacing this siamese by a triplet network can improve the classification accuracy in several problems, but this has yet to be demonstrated for local image descriptor learning. Moreover, we also demonstrate that a combination of the triplet and global losses produces the best embedding in the field, using this triplet network.
46
Card image cap
Structured Receptive Fields in CNNs
We combine these ideas into structured receptive field networks, a model which has a fixed filter basis and yet retains the flexibility of CNNs. As a realistic small dataset example, we show state-of-the-art classification results on popular 3D MRI brain-disease datasets where pre-training is difficult due to a lack of large public datasets in a similar domain.
47
Card image cap
NetVLAD: CNN architecture for weakly supervised place recognition
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions.
48
Card image cap
Adaptive Object Detection Using Adjacency and Zoom Prediction
State-of-the-art object detection systems rely on an accurate set of region proposals. Compared to methods based on fixed anchor locations, our approach naturally adapts to cases where object instances are sparse and small.
49
Card image cap
Deep Saliency with Encoded Low level Distance Map and High Level Features
Recent advances in saliency detection have utilized deep learning to obtain high level features to detect salient regions in a scene. The high level features are extracted using the VGG-net, and the low level features are compared with other parts of an image to form a low level distance map.
50
Card image cap
Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimising Global Loss Functions
Current results from machine learning show that replacing this siamese by a triplet network can improve the classification accuracy in several problems, but this has yet to be demonstrated for local image descriptor learning. Moreover, we also demonstrate that a combination of the triplet and global losses produces the best embedding in the field, using this triplet network.
51
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
52
Card image cap
MovieQA: Understanding Stories in Movies through Question-Answering
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 14,944 questions about 408 movies with high semantic diversity.
53
Card image cap
Loss Functions for Top-k Error: Analysis and Insights
In order to push the performance on realistic computer vision tasks, the number of classes in modern benchmark datasets has significantly increased in recent years. In the experiments, we compare on various datasets all of the proposed and established methods for top-k error optimization.
54
Card image cap
Compact Bilinear Pooling
Bilinear models has been shown to achieve impressive performance on a wide range of visual tasks, such as semantic segmentation, fine grained recognition and face recognition. However, bilinear features are high dimensional, typically on the order of hundreds of thousands to a few million, which makes them impractical for subsequent analysis.
55
Card image cap
Rethinking the Inception Architecture for Computer Vision
Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks.
56
Card image cap
Bottom-Up and Top-Down Reasoning with Hierarchical Rectified Gaussians
We show that RGs can be optimized with a quadratic program (QP), that can in turn be optimized with a recurrent neural network (with rectified linear units). From a practical perspective, RGs are well suited for detailed spatial tasks that can benefit from top-down reasoning.
57
Card image cap
Stacked Attention Networks for Image Question Answering
This paper presents stacked attention networks (SANs) that learn to answer natural language questions from images. Thus, we develop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively.
58
Card image cap
Training Region-based Object Detectors with Online Hard Example Mining
Our motivation is the same as it has always been -- detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Moreover, combined with complementary advances in the field, OHEM leads to state-of-the-art results of 78.9% and 76.3% mAP on PASCAL VOC 2007 and 2012 respectively.
59
Card image cap
Object Contour Detection with a Fully Convolutional Encoder-Decoder Network
We develop a deep learning algorithm for contour detection with a fully convolutional encoder-decoder network. Different from previous low-level edge detection, our algorithm focuses on detecting higher-level object contours.
60
Card image cap
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
This means that the super-resolution (SR) operation is performed in HR space. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space.
61
Card image cap
Deep Saliency with Encoded Low level Distance Map and High Level Features
Recent advances in saliency detection have utilized deep learning to obtain high level features to detect salient regions in a scene. The high level features are extracted using the VGG-net, and the low level features are compared with other parts of an image to form a low level distance map.
62
Card image cap
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
This means that the super-resolution (SR) operation is performed in HR space. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space.
63
Card image cap
DeepFool: a simple and accurate method to fool deep neural networks
State-of-the-art deep neural networks have achieved impressive results on many image classification tasks. However, these same architectures have been shown to be unstable to small, well sought, perturbations of the images.
64
Card image cap
Learning Deep Representations of Fine-grained Visual Descriptions
State-of-the-art methods for zero-shot visual recognition formulate learning as a joint embedding problem of images and side information. In these formulations the current best complement to visual features are attributes: manually encoded vectors describing shared characteristics among categories.
65
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
66
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
67
Card image cap
Convolutional Pose Machines
Pose Machines provide a sequential prediction framework for learning rich implicit spatial models. In this work we show a systematic design for how convolutional networks can be incorporated into the pose machine framework for learning image features and image-dependent spatial models for the task of pose estimation.
68
Card image cap
Staple: Complementary Learners for Real-Time Tracking
Correlation Filter-based trackers have recently achieved excellent performance, showing great robustness to challenging situations exhibiting motion blur and illumination changes. However, since the model that they learn depends strongly on the spatial layout of the tracked object, they are notoriously sensitive to deformation.
69
Card image cap
Training Region-based Object Detectors with Online Hard Example Mining
Our motivation is the same as it has always been -- detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Moreover, combined with complementary advances in the field, OHEM leads to state-of-the-art results of 78.9% and 76.3% mAP on PASCAL VOC 2007 and 2012 respectively.
70
Card image cap
Eye Tracking for Everyone
We believe that we can put the power of eye tracking in everyone's palm by building eye tracking software that works on commodity hardware such as mobile phones and tablets, without the need for additional sensors or devices. Our model achieves a prediction error of 1.71cm and 2.53cm without calibration on mobile phones and tablets respectively.
71
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
72
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
73
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
74
Card image cap
CNN-RNN: A Unified Framework for Multi-label Image Classification
While deep convolutional neural networks (CNNs) have shown a great success in single-label image classification, it is important to note that real world images generally contain multiple labels, which could correspond to different objects, scenes, actions and attributes in an image. Traditional approaches to multi-label image classification learn independent classifiers for each category and employ ranking or thresholding on the classification results.
75
Card image cap
Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis
This paper studies a combination of generative Markov random field (MRF) models and discriminatively trained deep convolutional neural networks (dCNNs) for synthesizing 2D images. The generative MRF acts on higher-levels of a dCNN feature pyramid, controling the image layout at an abstract level.
76
Card image cap
Learning Deep Features for Discriminative Localization
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
77
Card image cap
Context Encoders: Feature Learning by Inpainting
We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s).
78
Card image cap
Context Encoders: Feature Learning by Inpainting
We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s).
79
Card image cap
Context Encoders: Feature Learning by Inpainting
We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s).
80
Card image cap
Context Encoders: Feature Learning by Inpainting
We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s).
81
Card image cap
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
This means that the super-resolution (SR) operation is performed in HR space. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space.
82
Card image cap
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
This means that the super-resolution (SR) operation is performed in HR space. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space.
83
Card image cap
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
This means that the super-resolution (SR) operation is performed in HR space. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space.
84
Learning Deep Representations of Fine-grained Visual Descriptions
State-of-the-art methods for zero-shot visual recognition formulate learning as a joint embedding problem of images and side information. In these formulations the current best complement to visual features are attributes: manually encoded vectors describing shared characteristics among categories.
85
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
86
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
87
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
88
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
89
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
90
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
91
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
92
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
93
Card image cap
You Only Look Once: Unified, Real-Time Object Detection
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
94
Card image cap
Inverting Visual Representations with Convolutional Networks
Inverting a deep network trained on ImageNet provides several insights into the properties of the feature representation learned by the network. Most strikingly, the colors and the rough contours of an image can be reconstructed from activations in higher network layers and even from the predicted class probabilities.
95
Card image cap
Stacked Attention Networks for Image Question Answering
This paper presents stacked attention networks (SANs) that learn to answer natural language questions from images. Thus, we develop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively.
96
Card image cap
Learning Multi-Domain Convolutional Neural Networks for Visual Tracking
Our algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state.
97
Card image cap
Learning Deep Features for Discriminative Localization
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
98
Card image cap
Learning Deep Features for Discriminative Localization
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
99
Card image cap
Learning Deep Features for Discriminative Localization
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.
100
Card image cap
Learning Deep Features for Discriminative Localization
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks.