We experimented with different models of representation learning and used the learned model to generate synthetic process data.
We investigate the use of the mapping-based method in the time domain and show that it can perform better on a large training set than the masking-based method.
We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw pixels, but with heavy memory usage.
Ranked #11 on Action Classification on Charades
no code implementations • 2 Apr 2021 • Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, Ivan Marsic, Cees G. M. Snoek, Joseph Tighe
We propose TubeR: a simple solution for spatio-temporal video action detection.
Multi-label activity recognition is designed for recognizing multiple activities that are performed simultaneously or sequentially in each video.
Most current speech enhancement models use spectrogram features that require an expensive transformation and result in phase information loss.
The proposed hybrid attention architecture helps the system focus on learning informative representations for both modality-specific feature extraction and model fusion.
Multimodal affective computing, learning to recognize and interpret human affects and subjective information from multiple data sources, is still challenging because: (i) it is hard to extract informative features to represent human affects from heterogeneous inputs; (ii) current fusion strategies only fuse different modalities at abstract level, ignoring time-dependent interactions between modalities.
In this paper, we present a novel deep multimodal framework to predict human emotions based on sentence-level spoken language.
We applied PIMA to analyzing medical workflow data, showing how iterative alignment can better represent the data and facilitate the extraction of insights from data visualization.
For the Olympic swimming dataset, our system achieved an accuracy of 88%, an F1-score of 0. 58, a completeness estimation error of 6. 3% and a remaining-time estimation error of 2. 9 minutes.
Our system is the first to address the concurrent activity recognition with multisensory data using a single model, which is scalable, simple to train and easy to deploy.