This paper presents a study of semi-supervised learning with large convolutional networks.
Ranked #85 on Image Classification on ImageNet (using extra training data)
HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene.
Ranked #6 on Action Recognition on UCF101
So far life-long learning (LLL) has been studied in relatively small-scale and relatively artificial setups.
The videos retrieved by the search engines are then veried for correctness by human annotators.
The ability to capture temporal information has been critical to the development of video understanding models.
ImageNet classification is the de facto pretraining task for these models.
Ranked #63 on Image Classification on ImageNet (using extra training data)
Large scale visual understanding is challenging, as it requires a model to handle the widely-spread and imbalanced distribution of <subject, relation, object> triples.
This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video.
Ranked #6 on Keypoint Detection on COCO test-challenge
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.
Ranked #4 on Action Recognition on Sports-1M
Learning image representations with ConvNets by pre-training on ImageNet has proven useful across many visual understanding tasks including object detection, semantic segmentation, and image captioning.
Ranked #57 on Action Recognition on HMDB-51
This article discusses a framework to support the design and end-to-end planning of fixed millimeter-wave networks.
We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance.
Language has been exploited to sidestep the problem of defining video categories, by formulating video understanding as the task of captioning or description.
Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis.
Beyond classification, we further validate the saliency of the learnt representations via their attribute concentration and hierarchy recovery properties, achieving 10-25% relative gains on the softmax classifier and 25-50% on triplet loss in these tasks.
This paper aims to classify and locate objects accurately and efficiently, without using bounding box annotations.
Ranked #5 on Weakly Supervised Object Detection on COCO
With the widespread availability of cellphones and cameras that have GPS capabilities, it is common for images being uploaded to the Internet today to have GPS coordinates associated with them.
We explore the task of recognizing peoples' identities in photo albums in an unconstrained setting.
We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset.
Ranked #9 on Action Recognition on Sports-1M
The availability of large labeled datasets has allowed Convolutional Network models to achieve impressive recognition results.
We propose a method for inferring human attributes (such as gender, hair style, clothes style, expression, action) from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion.