Search Results for author: Arsha Nagrani

Found 34 papers, 13 papers with code

A CLIP-Hitchhiker's Guide to Long Video Retrieval

no code implementations17 May 2022 Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman

Our goal in this paper is the adaptation of image-text models for long video retrieval.

Frame Video Retrieval

Learning Audio-Video Modalities from Image Captions

no code implementations1 Apr 2022 Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.

Image Captioning Video Captioning +1

Audio-Visual Synchronisation in the wild

no code implementations8 Dec 2021 Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.

Lip Reading

Masking Modalities for Cross-modal Video Retrieval

no code implementations1 Nov 2021 Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.

Video Retrieval

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

1 code implementation1 Nov 2021 Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen

We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance.

Action Recognition Language Modelling

Attention Bottlenecks for Multimodal Fusion

1 code implementation NeurIPS 2021 Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.

 Ranked #1 on Audio Classification on VGGSound (Top 5 Accuracy metric)

Action Classification Action Recognition +1

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

4 code implementations ICCV 2021 Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

Ranked #6 on Video Retrieval on DiDeMo (using extra training data)

Text to Video Retrieval Video Captioning +1

Slow-Fast Auditory Streams For Audio Recognition

2 code implementations5 Mar 2021 Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen

We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs.

Audio Classification

WiCV 2020: The Seventh Women In Computer Vision Workshop

no code implementations11 Jan 2021 Hazel Doughty, Nour Karessli, Kathryn Leonard, Boyi Li, Carianne Martinez, Azadeh Mobasher, Arsha Nagrani, Srishti Yadav

It provides a voice to a minority (female) group in computer vision community and focuses on increasingly the visibility of these researchers, both in academia and industry.

Look Before you Speak: Visually Contextualized Utterances

no code implementations CVPR 2021 Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.

Cough Against COVID: Evidence of COVID-19 Signature in Cough Sounds

1 code implementation17 Sep 2020 Piyush Bagad, Aman Dalmia, Jigar Doshi, Arsha Nagrani, Parag Bhamare, Amrita Mahale, Saurabh Rane, Neeraj Agarwal, Rahul Panicker

Testing capacity for COVID-19 remains a challenge globally due to the lack of adequate supplies, trained personnel, and sample-processing equipment.

Spot the conversation: speaker diarisation in the wild

no code implementations2 Jul 2020 Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman

Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.

Speaker Verification

Speech2Action: Cross-modal Supervision for Action Recognition

no code implementations CVPR 2020 Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.

Action Recognition

Disentangled Speech Embeddings using Cross-modal Self-supervision

no code implementations20 Feb 2020 Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman

The objective of this paper is to learn representations of speaker identity without access to manually annotated data.

Self-Supervised Learning Speaker Recognition

VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

no code implementations5 Dec 2019 Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman

The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.

Speaker Recognition

WiCV 2019: The Sixth Women In Computer Vision Workshop

no code implementations23 Sep 2019 Irene Amerini, Elena Balashova, Sayna Ebrahimi, Kathryn Leonard, Arsha Nagrani, Amaia Salvador

In this paper we present the Women in Computer Vision Workshop - WiCV 2019, organized in conjunction with CVPR 2019.

Count, Crop and Recognise: Fine-Grained Recognition in the Wild

no code implementations19 Sep 2019 Max Bain, Arsha Nagrani, Daniel Schofield, Andrew Zisserman

The goal of this paper is to label all the animal individuals present in every frame of a video.

Frame

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

1 code implementation ICCV 2019 Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen

We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i. e. the combination of modalities within a range of temporal offsets.

Action Recognition Egocentric Activity Recognition

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

3 code implementations31 Jul 2019 Yang Liu, Samuel Albanie, Arsha Nagrani, Andrew Zisserman

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge.

Video Retrieval

Utterance-level Aggregation For Speaker Recognition In The Wild

8 code implementations26 Feb 2019 Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman

The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.

Frame Speaker Recognition +1

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

no code implementations16 Aug 2018 Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets.

Ranked #3 on Facial Expression Recognition on FERPlus (using extra training data)

Facial Emotion Recognition Facial Expression Recognition +1

VoxCeleb2: Deep Speaker Recognition

2 code implementations14 Jun 2018 Joon Son Chung, Arsha Nagrani, Andrew Zisserman

The objective of this paper is speaker recognition under noisy and unconstrained conditions.

 Ranked #1 on Speaker Verification on VoxCeleb2 (using extra training data)

Speaker Recognition Speaker Verification

Seeing Voices and Hearing Faces: Cross-modal biometric matching

no code implementations CVPR 2018 Arsha Nagrani, Samuel Albanie, Andrew Zisserman

We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching, (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available), and (iii) we use human testing as a baseline to calibrate the difficulty of the task.

Face Recognition Speaker Identification

VoxCeleb: a large-scale speaker identification dataset

8 code implementations Interspeech 2018 Arsha Nagrani, Joon Son Chung, Andrew Zisserman

Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance.

Sound

Cannot find the paper you are looking for? You can Submit a new open access paper.