The resulting algorithm is referred to as AutoFocus and results in a 2. 5-5 times speed-up during inference when used with SNIP.
The widely adopted sequential variant of Non Maximum Suppression (or Greedy-NMS) is a crucial module for object-detection pipelines.
To this end, RSO adds a perturbation to a weight in a deep neural network and tests if it reduces the loss on a mini-batch.
Deep neural networks have been shown to suffer from poor generalization when small perturbations are added (like Gaussian noise), yet little work has been done to evaluate their robustness to more natural image transformations like photo filters.
Most work on automated deception detection (ADD) in video has two restrictions: (i) it focuses on a video of one person, and (ii) it focuses on a single act of deception in a one or two minute video.
We analyze how well their features generalize to tasks like image classification, semantic segmentation and object detection on small datasets like PASCAL-VOC, Caltech-256, SUN-397, Flowers-102 etc.
We present Temporal Aggregation Network (TAN) which decomposes 3D convolutions into spatial and temporal aggregation blocks.
Instead of processing an entire image pyramid, AutoFocus adopts a coarse to fine approach and only processes regions which are likely to contain small objects at finer scales.
Interestingly, we observe that after dropping 30% of the annotations (and labeling them as background), the performance of CNN-based object detectors like Faster-RCNN only drops by 5% on the PASCAL VOC dataset.
Our implementation based on Faster-RCNN with a ResNet-101 backbone obtains an mAP of 47. 6% on the COCO dataset for bounding box detection and can process 5 images per second during inference with a single GPU.
Ranked #2 on Object Detection on PASCAL VOC 2007
Our approach is a modification of the R-FCN architecture in which position-sensitive filters are shared across different object classes for performing localization.
On the COCO dataset, our single model performance is 45. 7% and an ensemble of 3 networks obtains an mAP of 48. 3%.
Ranked #88 on Object Detection on COCO test-dev
For each temporal segment inside a proposal, features are uniformly sampled at a pair of scales and are input to a temporal convolutional neural network for classification.
Ranked #7 on Action Recognition on THUMOS’14
To this end, we propose Soft-NMS, an algorithm which decays the detection scores of all other objects as a continuous function of their overlap with M. Hence, no object is eliminated in this process.
This work considers targeted style transfer, in which the style of a template image is used to alter only part of a target image.
We present a multi-stream bi-directional recurrent neural network for fine-grained action detection.
With the growing importance of large network models and enormous training datasets, GPUs have become increasingly necessary to train neural networks.
VRFP is a real-time video retrieval framework based on short text input queries, which obtains weakly labeled training images from the web after the query is known.
In this paper, we attempt to overcome the two above problems by proposing an optimization method for training deep neural networks which uses learning rates which are both specific to each layer in the network and adaptive to the curvature of the function, increasing the learning rate at low curvature points.
Given a text description of an event, event retrieval is performed by selecting concepts linguistically related to the event description and fusing the concept responses on unseen videos.