The standard paradigm is to utilize relationships in the input graph to transfer information using GCNs from training to testing nodes in the graph; for example, the semi-supervised, zero-shot, and few-shot learning setups.
In this paper, we introduce certainty-aware pseudo labels tailored for object detection, which can effectively estimate the classification and localization quality of derived pseudo labels.
We introduce DiscoBox, a novel framework that jointly learns instance segmentation and semantic correspondence using bounding box supervision.
Many variants of adversarial training have been proposed, with most research focusing on problems with relatively few classes.
The resulting algorithm is referred to as AutoFocus and results in a 2. 5-5 times speed-up during inference when used with SNIP.
This paper studies video inpainting detection, which localizes an inpainted region in a video both spatially and temporally.
We address the problem of scene layout generation for diverse domains such as images, mobile applications, documents and 3D objects.
Then, only frames and convolutions that are selected by the selection network are used in the 3D model to generate predictions.
We first train a teacher model on the labeled data and use it to generate pseudo labels for the unlabeled data.
Current action recognition systems require large amounts of training data for recognizing an action.
Ranked #7 on Zero-Shot Action Recognition on Kinetics
Results show that our framework achieves the state-of-the-art performance with 31 FPS and improves our baseline significantly by 9. 0% mAP on the nuScenes test set.
In contrast, we propose a general-purpose method that works on both indoor and outdoor scenes.
Object detection is an essential step towards holistic scene understanding.
Ranked #157 on Object Detection on COCO test-dev
Given color images and noisy and incomplete target depth maps, we optimize a randomly-initialized CNN model to reconstruct a depth map restored by virtue of using the CNN network structure as a prior combined with a view-constrained photo-consistency loss.
Deep neural networks have been shown to suffer from poor generalization when small perturbations are added (like Gaussian noise), yet little work has been done to evaluate their robustness to more natural image transformations like photo filters.
State-of-the-art object detectors rely on regressing and classifying an extensive list of possible anchors, which are divided into positive and negative samples based on their intersection-over-union (IoU) with corresponding groundtruth objects.
This paper presents LiteEval, a simple yet effective coarse-to-fine framework for resource efficient video recognition, suitable for both online and offline scenarios.
Active learning (AL) combines data labeling and model training to minimize the labeling cost by prioritizing the selection of high value data that can best improve model performance.
Recognizing objects from subcategories with very subtle differences remains a challenging task due to the large intra-class and small inter-class variation.
Ranked #7 on Fine-Grained Image Classification on NABirds (using extra training data)
We propose weakly supervised language localization networks (WSLLN) to detect events in long, untrimmed videos given language queries.
Non-negative matrix factorization (NMF) minimizes the Euclidean distance between the data matrix and its low rank approximation, and it fails when applied to corrupted data because the loss function is sensitive to outliers.
Adversarial training, in which a network is trained on adversarial examples, is one of the few defenses against adversarial attacks that withstands strong attacks.
However, neural classifiers are often extremely brittle when confronted with domain shift---changes in the input distribution that occur over time.
We analyze how well their features generalize to tasks like image classification, semantic segmentation and object detection on small datasets like PASCAL-VOC, Caltech-256, SUN-397, Flowers-102 etc.
This paper presents a new task, the grounding of spatio-temporal identifying descriptions in videos.
Instead of sequentially distilling knowledge only from the last model, we directly leverage all previous model snapshots.
We propose StartNet to address Online Detection of Action Start (ODAS) where action starts and their associated categories are detected in untrimmed, streaming videos.
The latent representations are jointly optimized with the corresponding generation network to condition the synthesis process, encouraging a diverse set of generated results that are visually compatible with existing fashion garments.
We present Temporal Aggregation Network (TAN) which decomposes 3D convolutions into spatial and temporal aggregation blocks.
Instead of processing an entire image pyramid, AutoFocus adopts a coarse to fine approach and only processes regions which are likely to contain small objects at finer scales.
In this paper, we present Moment Alignment Network (MAN), a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot feed-forward network.
We present AdaFrame, a framework that adaptively selects relevant frames on a per-input basis for fast video recognition.
Standard adversarial attacks change the predicted class label of a selected image by adding specially tailored small perturbations to its pixels.
We show results on CAD120 (which provides pre-computed node features and edge weights for fair performance comparison across algorithms) as well as a more complex real-world activity dataset, Charades.
The advent of image sharing platforms and the easy availability of advanced photo editing software have resulted in a large quantities of manipulated images being shared on the internet.
This encourages the network to preserve the geometric structure in Euclidean space throughout the feature extraction hierarchy.
It is of interest to the community to explicitly discover such biases, both for understanding the behavior of such models, and towards debugging them.
Most work on temporal action detection is formulated as an offline problem, in which the start and end times of actions are determined after the entire video is fully observed.
Existing 3D pose datasets of object categories are limited to generic object types and lack of fine-grained information.
Video summarization is a challenging under-constrained problem because the underlying summary of a single video strongly depends on users' subjective understandings.
Interestingly, we observe that after dropping 30% of the annotations (and labeling them as background), the performance of CNN-based object detectors like Faster-RCNN only drops by 5% on the PASCAL VOC dataset.
Our implementation based on Faster-RCNN with a ResNet-101 backbone obtains an mAP of 47. 6% on the COCO dataset for bounding box detection and can process 5 images per second during inference with a single GPU.
Ranked #2 on Object Detection on PASCAL VOC 2007
Image manipulation detection is different from traditional semantic object detection because it pays more attention to tampering artifacts than to image content, which suggests that richer features need to be learned.
In particular, given an image from the source domain and unlabeled samples from the target domain, the generator synthesizes new images on-the-fly to resemble samples from the target domain in appearance and the segmentation network further refines high-level features before predicting semantic maps, both of which leverage feature statistics of sampled images from the target domain.
We address the recognition of agent-in-place actions, which are associated with agents who perform them and places where they occur, in the context of outdoor home surveillance.
To dramatically speedup relevant motion event detection and improve its performance, we propose a novel network for relevant motion event detection, ReMotENet, which is a unified, end-to-end data-driven method using spatial-temporal attention-based 3D ConvNets to jointly model the appearance and motion of objects-of-interest in a video.
Our approach is a modification of the R-FCN architecture in which position-sensitive filters are shared across different object classes for performing localization.
Very deep convolutional neural networks offer excellent recognition results, yet their computational expense limits their impact for many real-world applications.
We present an image-based VIirtual Try-On Network (VITON) without using 3D information in any form, which seamlessly transfers a desired clothing item onto the corresponding region of a person using a coarse-to-fine strategy.
On the COCO dataset, our single model performance is 45. 7% and an ensemble of 3 networks obtains an mAP of 48. 3%.
Ranked #88 on Object Detection on COCO test-dev
In contrast, we argue that it is essential to prune neurons in the entire neuron network jointly based on a unified goal: minimizing the reconstruction error of important responses in the "final response layer" (FRL), which is the second-to-last layer before classification, for a pruned network to retrain its predictive power.
We introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios where objects with varied sizes appear in high resolution images.
We introduce count-guided weakly supervised localization (C-WSL), an approach that uses per-class object count as a new form of supervision to improve weakly supervised localization (WSL).
Anticipating future actions is a key component of intelligence, specifically when it applies to real-time systems, such as robots or autonomous cars.
For each temporal segment inside a proposal, features are uniformly sampled at a pair of scales and are input to a temporal convolutional neural network for classification.
Ranked #9 on Action Recognition on THUMOS’14
This paper proposes an automatic spatially-aware concept discovery approach using weakly labeled image-text data from shopping websites.
Understanding visual relationships involves identifying the subject, the object, and a predicate relating them.
To this end, we propose to jointly learn a visual-semantic embedding and the compatibility relationships among fashion items in an end-to-end fashion.
We then build a multi-level deep architecture to exploit the first and second order information within different convolutional layers.
To this end, we propose Soft-NMS, an algorithm which decays the detection scores of all other objects as a continuous function of their overlap with M. Hence, no object is eliminated in this process.
In particular, we propose spatial context networks that learn to predict a representation of one image patch from another image patch, within the same image, conditioned on their real-valued relative spatial offset.
We present a Deep Convolutional Neural Network architecture which serves as a generic image-to-image regressor that can be trained end-to-end without any further machinery.
Even with the recent advances in convolutional neural networks (CNN) in various visual recognition tasks, the state-of-the-art action recognition system still relies on hand crafted motion feature such as optical flow to achieve the best performance.
Ranked #55 on Action Recognition on HMDB-51
Compared to earlier multistage frameworks using CNN features, recent end-to-end deep approaches for fine-grained recognition essentially enhance the mid-level learning capability of CNNs.
Ranked #14 on Fine-Grained Image Classification on CUB-200-2011
Since interactions between objects can be reduced to a limited set of atomic spatial relations in 3D, we study the possibility of inferring 3D structure from a text description rather than an image, applying physical relation models to synthesize holistic 3D abstract object layouts satisfying the spatial constraints present in a textual description.
Deep learning modeling lifecycle generates a rich set of data artifacts, such as learned parameters and training logs, and comprises of several frequently conducted tasks, e. g., to understand the model behaviors and to try out new models.
A single shot deep convolutional network is trained as a object detector to generate all possible pedestrian candidates of different sizes and occlusions.
Ranked #16 on Pedestrian Detection on Caltech
We investigate the reasons why context in object detection has limited utility by isolating and evaluating the predictive power of different context cues under ideal conditions in which context provided by an oracle.
Our approach uses an LSTM to learn the probability of a referring expression, with input features from a region and a context region.
Fine-grained classification involves distinguishing between similar sub-categories based on subtle differences in highly localized regions; therefore, accurate localization of discriminative regions remains a major challenge.
In the first stage of classification, binary codes are considered as class labels by a set of binary SVMs; each corresponds to one bit.
Perceiving meaningful activities in a long video sequence is a challenging problem due to ambiguous definition of 'meaningfulness' as well as clutters in the scene.
Ranked #2 on Traffic Accident Detection on A3D
Based on the multi-scale nature of objects in images, our approach is built on top of a hierarchical segmentation.
Sparse representations have been successfully applied to signal processing, computer vision and machine learning.
Action recognition tasks usually relies on complex handcrafted structures as features to represent the human action model.
VRFP is a real-time video retrieval framework based on short text input queries, which obtains weakly labeled training images from the web after the query is known.
Since attributes are generally correlated, we introduce a low rank attribute embedding into the MTL formulation to embed original binary attributes to a continuous attribute space, where incorrect and incomplete attributes are rectified and recovered to better describe people.
Given a text description of an event, event retrieval is performed by selecting concepts linguistically related to the event description and fusing the concept responses on unseen videos.
Many existing recognition algorithms combine different modalities based on training accuracy but do not consider the possibility of noise at test time.
We present a supervised binary encoding scheme for image retrieval that learns projections by taking into account similarity between classes obtained from output embeddings.
We discuss methodological issues related to the evaluation of unsupervised binary code construction methods for nearest neighbor search.
To perform unconstrained face recognition robust to variations in illumination, pose and expression, this paper presents a new scheme to extract "Multi-Directional Multi-Level Dual-Cross Patterns" (MDML-DCPs) from face images.
We propose a method to expand the visual coverage of training sets that consist of a small number of labeled examples using learned attributes.
Using an analogous reasoning, we present an approach that combines bag-of-words and spatial models to perform semantic and syntactic analysis for recognition of an object based on its internal appearance and its context.
To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth.