no code implementations • 17 May 2022 • Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
Our goal in this paper is the adaptation of image-text models for long video retrieval.
no code implementations • 1 Apr 2022 • Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid
To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
no code implementations • 20 Jan 2022 • Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
Recent video and language pretraining frameworks lack the ability to generate sentences.
no code implementations • 8 Dec 2021 • Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
no code implementations • 1 Nov 2021 • Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.
1 code implementation • 1 Nov 2021 • Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen
We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance.
no code implementations • ACL 2021 • Cesar Ilharco, Afsaneh Shirazi, Arjun Gopalan, Arsha Nagrani, Blaz Bratanic, Chris Bregler, Christina Funk, Felipe Ferreira, Gabriel Barcik, Gabriel Ilharco, Georg Osang, Jannis Bulian, Jared Frank, Lucas Smaira, Qin Cao, Ricardo Marino, Roma Patel, Thomas Leung, Vaiva Imbrasaite
How information is created, shared and consumed has changed rapidly in recent decades, in part thanks to new social platforms and technologies on the web.
1 code implementation • NeurIPS 2021 • Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.
Ranked #1 on
Audio Classification
on VGGSound
(Top 5 Accuracy metric)
1 code implementation • CVPR 2021 • Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset.
no code implementations • ICCV 2021 • Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid
We focus on contrastive methods for self-supervised video representation learning.
4 code implementations • ICCV 2021 • Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.
Ranked #6 on
Video Retrieval
on DiDeMo
(using extra training data)
2 code implementations • 5 Mar 2021 • Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs.
no code implementations • 11 Jan 2021 • Hazel Doughty, Nour Karessli, Kathryn Leonard, Boyi Li, Carianne Martinez, Azadeh Mobasher, Arsha Nagrani, Srishti Yadav
It provides a voice to a minority (female) group in computer vision community and focuses on increasingly the visibility of these researchers, both in academia and industry.
no code implementations • 12 Dec 2020 • Arsha Nagrani, Joon Son Chung, Jaesung Huh, Andrew Brown, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman
We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020.
no code implementations • CVPR 2021 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.
1 code implementation • 17 Sep 2020 • Piyush Bagad, Aman Dalmia, Jigar Doshi, Arsha Nagrani, Parag Bhamare, Amrita Mahale, Saurabh Rane, Neeraj Agarwal, Rahul Panicker
Testing capacity for COVID-19 remains a challenge globally due to the lack of adequate supplies, trained personnel, and sample-processing equipment.
1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
no code implementations • ECCV 2020 • Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid
Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind.
no code implementations • 2 Jul 2020 • Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman
Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.
1 code implementation • 8 May 2020 • Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman
Our objective in this work is long range understanding of the narrative structure of movies.
no code implementations • CVPR 2020 • Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.
no code implementations • 20 Feb 2020 • Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman
The objective of this paper is to learn representations of speaker identity without access to manually annotated data.
no code implementations • 5 Dec 2019 • Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman
The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.
no code implementations • 23 Sep 2019 • Irene Amerini, Elena Balashova, Sayna Ebrahimi, Kathryn Leonard, Arsha Nagrani, Amaia Salvador
In this paper we present the Women in Computer Vision Workshop - WiCV 2019, organized in conjunction with CVPR 2019.
no code implementations • 19 Sep 2019 • Max Bain, Arsha Nagrani, Daniel Schofield, Andrew Zisserman
The goal of this paper is to label all the animal individuals present in every frame of a video.
1 code implementation • ICCV 2019 • Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i. e. the combination of modalities within a range of temporal offsets.
Ranked #2 on
Egocentric Activity Recognition
on EPIC-KITCHENS-55
3 code implementations • 31 Jul 2019 • Yang Liu, Samuel Albanie, Arsha Nagrani, Andrew Zisserman
The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge.
Ranked #8 on
Video Retrieval
on DiDeMo
8 code implementations • 26 Feb 2019 • Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman
The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.
no code implementations • 16 Aug 2018 • Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets.
Ranked #3 on
Facial Expression Recognition
on FERPlus
(using extra training data)
2 code implementations • 14 Jun 2018 • Joon Son Chung, Arsha Nagrani, Andrew Zisserman
The objective of this paper is speaker recognition under noisy and unconstrained conditions.
Ranked #1 on
Speaker Verification
on VoxCeleb2
(using extra training data)
no code implementations • ECCV 2018 • Arsha Nagrani, Samuel Albanie, Andrew Zisserman
We propose and investigate an identity sensitive joint embedding of face and voice.
no code implementations • CVPR 2018 • Arsha Nagrani, Samuel Albanie, Andrew Zisserman
We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching, (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available), and (iii) we use human testing as a baseline to calibrate the difficulty of the task.
no code implementations • 31 Jan 2018 • Arsha Nagrani, Andrew Zisserman
The goal of this paper is the automatic identification of characters in TV and feature film material.
8 code implementations • Interspeech 2018 • Arsha Nagrani, Joon Son Chung, Andrew Zisserman
Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance.
Sound