In this paper, we propose a subspace representation learning (SRL) framework to tackle few-shot image classification tasks.
The conventional solution to this task is to minimize the discrepancy between source and target to enable effective knowledge transfer.
Ranked #11 on Synthetic-to-Real Translation on SYNTHIA-to-Cityscapes
To deal with the aforementioned problems, in this paper, we propose a training-free monocular 3D event detection system for traffic surveillance.
Unsupervised Domain Adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in the source domain.
Ranked #1 on Domain Adaptation on VisDA2017
In this paper, we address this problem by training relational context-aware agents which learn the actions to localize the target person from the gallery of whole scene images.
In this work, we propose the idea of visual distributional representation, which interprets an image set as samples drawn from an unknown distribution in appearance feature space.
In this work, we first describe a CNN based approach for weakly supervised training of audio events.
DecideNet starts with estimating the crowd density by generating detection and regression based density maps separately.
Ranked #9 on Crowd Counting on WorldExpo’10
relevant) to the given event class, we formulate this task as a multi-instance learning (MIL) problem by taking each video as a bag and the video shots in each video as instances.
The proposed method can take full advantages of the structured distance relationships among these training samples, with the constructed complete graph.
We report on CMU Informedia Lab's system used in Google's YouTube 8 Million Video Understanding Challenge.
State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs.
Ranked #15 on Action Recognition on UCF101
The heterogeneity-gap between different modalities brings a significant challenge to multimedia information retrieval.
Person re-identification (re-ID) has become increasingly popular in the community due to its application and research significance.
Ranked #68 on Person Re-Identification on DukeMTMC-reID
Multimedia event detection has been receiving increasing attention in recent years.
no code implementations • 17 Jun 2016 • Shoou-I Yu, Yi Yang, Zhongwen Xu, Shicheng Xu, Deyu Meng, Zexi Mao, Zhigang Ma, Ming Lin, Xuanchong Li, Huan Li, Zhenzhong Lan, Lu Jiang, Alexander G. Hauptmann, Chuang Gan, Xingzhong Du, Xiaojun Chang
The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search.
Therefore, our tracker propagates identity information to frames without recognized faces by uncovering the appearance and spatial manifold formed by person detections.
In this paper, we focus on automatically detecting events in unconstrained videos without the use of any visual training exemplars.
We propose two well-motivated ranking-based methods to enhance the performance of current state-of-the-art human activity recognition systems.
We approach this problem by first showing that local handcrafted features and Convolutional Neural Networks (CNNs) share the same convolution-pooling network structure.
In this work, we introduce Video Question Answering in temporal domain to infer the past, describe the present and predict the future.
Therefore, they need to occur frequently enough in the videos and to be be able to tell the difference among different types of motions.
MIFS compensates for information lost from using differential operators by recapturing information at coarse scales.
In this paper, we propose a discriminative video representation for event detection over a large scale video dataset when only limited hardware resources are available.
We address the challenging problem of utilizing related exemplars for complex event detection while multiple features are available.
Compared to complex event videos, these external videos contain simple contents such as objects, scenes and actions which are the basic elements of complex events.