High-level understanding of stories in video such as movies and TV shows from raw data is extremely challenging.
Ranked #1 on Video Question Answering on KnowIT VQA
Existing generative adversarial networks (GANs) for speech enhancement solely rely on the convolution operation, which may obscure temporal dependencies across the sequence input.
Audio event localization and detection (SELD) have been commonly tackled using multitask models.
Humans share a strong tendency to memorize/forget some of the visual information they encounter.
Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events.
Audio fingerprinting, also named as audio hashing, has been well-known as a powerful technique to perform audio identification and synchronization.