We present NMT-Keras, a flexible toolkit for training deep learning models, which puts a particular emphasis on the development of advanced applications of neural machine translation systems, such as interactive-predictive translation protocols and long-term adaptation of the translation system via continuous learning.
In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.
#11 best model for Action Recognition In Videos on Something-Something V1 (using extra training data)
We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs). Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset.
Current deep caption models can only describe objects contained in paired image-sentence corpora, despite the fact that they are pre-trained with large object recognition datasets, namely ImageNet.
A test video is processed by forming correspondences between its clips and the clips of reference videos with known semantics, following which, reference semantics can be transferred to the test video.
To address this problem, we propose an end-to-end transformer model for dense video captioning.
Although traditionally used in the machine translation field, the encoder-decoder framework has been recently applied for the generation of video and image descriptions.
We present our submission to the Microsoft Video to Language Challenge of generating short captions describing videos in the challenge dataset.
In this paper, we describe the system for generating textual descriptions of short video clips using recurrent neural networks (RNN), which we used while participating in the Large Scale Movie Description Challenge 2015 in ICCV 2015.