1 code implementation • 16 Jun 2020 • Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
Further, we propose a tri-modal model that jointly processes raw audio, video, and text captions from videos to learn a multi-modal semantic embedding space useful for text-video retrieval.
Affective computing (AC) of these data can help to understand human behaviors and enable wide applications.
Fine-grained action detection is an important task with numerous applications in robotics and human-computer interaction.
The production of sports highlight packages summarizing a game's most exciting moments is an essential task for broadcast media.