Speech emotion recognition (SER) models typically rely on costly human-labeled data for training, making scaling methods to large speech datasets and nuanced emotion taxonomies difficult.
While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.
Ranked #1 on Action Segmentation on COIN
We present a framework that formulates visual question answering as modular code generation.
2 code implementations • 29 May 2023 • Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.
Ranked #1 on Fine-Grained Image Recognition on OVEN
Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time.
Ranked #8 on Video Question Answering on NExT-QA (using extra training data)
All such recipes rely on augmenting visual embeddings with temporal information (i. e., image -> video), often keeping text embeddings unchanged or even being discarded.
Ranked #8 on Action Classification on Charades
(ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech).
The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form.
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)
This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022.
In this work, we focus on summarizing instructional videos, an under-explored area of video summarization.
This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge.
Ranked #1 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)
Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.
Our goal in this paper is the adaptation of image-text models for long video retrieval.
Ranked #4 on Zero-Shot Action Recognition on Charades
To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
Ranked #19 on Zero-Shot Video Retrieval on MSR-VTT (using extra training data)
Recent video and language pretraining frameworks lack the ability to generate sentences.
Ranked #10 on Video Captioning on MSR-VTT (using extra training data)
Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.
We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance.
no code implementations • • Cesar Ilharco, Afsaneh Shirazi, Arjun Gopalan, Arsha Nagrani, Blaz Bratanic, Chris Bregler, Christina Funk, Felipe Ferreira, Gabriel Barcik, Gabriel Ilharco, Georg Osang, Jannis Bulian, Jared Frank, Lucas Smaira, Qin Cao, Ricardo Marino, Roma Patel, Thomas Leung, Vaiva Imbrasaite
How information is created, shared and consumed has changed rapidly in recent decades, in part thanks to new social platforms and technologies on the web.
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.
Ranked #1 on Action Classification on Kinetics-Sounds
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.
Ranked #13 on Zero-Shot Video Retrieval on DiDeMo (using extra training data)
We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs.
It provides a voice to a minority (female) group in computer vision community and focuses on increasingly the visibility of these researchers, both in academia and industry.
We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020.
Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.
Testing capacity for COVID-19 remains a challenge globally due to the lack of adequate supplies, trained personnel, and sample-processing equipment.
1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind.
Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.
Our objective in this work is long range understanding of the narrative structure of movies.
We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.
The objective of this paper is to learn representations of speaker identity without access to manually annotated data.
The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.
In this paper we present the Women in Computer Vision Workshop - WiCV 2019, organized in conjunction with CVPR 2019.
We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i. e. the combination of modalities within a range of temporal offsets.
Ranked #2 on Egocentric Activity Recognition on EPIC-KITCHENS-55
The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge.
Ranked #20 on Video Retrieval on MSVD
The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.
We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets.
Ranked #3 on Facial Expression Recognition (FER) on FERPlus
The objective of this paper is speaker recognition under noisy and unconstrained conditions.
Ranked #1 on Speaker Verification on VoxCeleb2 (using extra training data)
We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching, (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available), and (iii) we use human testing as a baseline to calibrate the difficulty of the task.
The goal of this paper is the automatic identification of characters in TV and feature film material.
Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance.