Search Results for author: Arsha Nagrani

Found 54 papers, 24 papers with code

Attention Bottlenecks for Multimodal Fusion

1 code implementation • NeurIPS 2021 • Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.

Ranked #2 on Action Classification on Kinetics-Sounds

Action Classification Action Recognition +2

2,979

Paper
Code

UnLoc: A Unified Framework for Video Localization Tasks

1 code implementation • ICCV 2023 • Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid

While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.

Ranked #1 on Action Segmentation on COIN

Action Segmentation Moment Retrieval +5

2,979

Paper
Code

Streaming Dense Video Captioning

1 code implementation • 1 Apr 2024 • Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.

Dense Video Captioning

2,979

Paper
Code

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

3 code implementations • CVPR 2023 • Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.

Ranked #1 on Dense Video Captioning on ActivityNet Captions (using extra training data)

Dense Video Captioning Language Modelling +1

2,978

Paper
Code

Verbs in Action: Improving verb understanding in video-language models

1 code implementation • ICCV 2023 • Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time.

Ranked #11 on Zero-Shot Video Question Answer on NExT-QA

Contrastive Learning Text Matching +2

2,976

Paper
Code

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

5 code implementations • ICCV 2021 • Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

Ranked #4 on Video Retrieval on QuerYD (using extra training data)

Retrieval Text Retrieval +4

2,959

Paper
Code

VoxCeleb: a large-scale speaker identification dataset

8 code implementations • Interspeech 2018 • Arsha Nagrani, Joon Son Chung, Andrew Zisserman

Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance.

Sound

487

Paper
Code

Utterance-level Aggregation For Speaker Recognition In The Wild

9 code implementations • 26 Feb 2019 • Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman

The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.

Speaker Recognition Text-Independent Speaker Verification

446

Paper
Code

VoxCeleb2: Deep Speaker Recognition

2 code implementations • 14 Jun 2018 • Joon Son Chung, Arsha Nagrani, Andrew Zisserman

The objective of this paper is speaker recognition under noisy and unconstrained conditions.

Ranked #1 on Speaker Verification on VoxCeleb2 (using extra training data)

Speaker Recognition Speaker Verification

366

Paper
Code

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

3 code implementations • 31 Jul 2019 • Yang Liu, Samuel Albanie, Arsha Nagrani, Andrew Zisserman

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge.

Ranked #24 on Video Retrieval on MSVD

Natural Language Queries Retrieval +2

327

Paper
Code

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao

This report summarizes the results of the first edition of the challenge together with the findings of the participants.

Natural Language Queries Retrieval +3

327

Paper
Code

Condensed Movies: Story Based Retrieval with Contextual Embeddings

1 code implementation • 8 May 2020 • Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman

Our objective in this work is long range understanding of the narrative structure of movies.

Retrieval Text to Video Retrieval +1

146

Paper
Code

AutoAD: Movie Description in Context

1 code implementation • CVPR 2023 • Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form.

Image Captioning Text Generation

134

Paper
Code

AVATAR: Unconstrained Audiovisual Speech Recognition

1 code implementation • 15 Jun 2022 • Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

119

Paper
Code

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

1 code implementation • ICCV 2019 • Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen

We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i. e. the combination of modalities within a range of temporal offsets.

Ranked #2 on Egocentric Activity Recognition on EPIC-KITCHENS-55

Action Recognition Egocentric Activity Recognition

103

Paper
Code

Modular Visual Question Answering via Code Generation

1 code implementation • 8 Jun 2023 • Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein

We present a framework that formulates visual question answering as modular code generation.

Code Generation In-Context Learning +2

Paper
Code

Localizing Visual Sounds the Hard Way

1 code implementation • CVPR 2021 • Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset.

Contrastive Learning

Paper
Code

PaLI-X: On Scaling up a Multilingual Vision and Language Model

2 code implementations • 29 May 2023 • Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.

Ranked #1 on Fine-Grained Image Recognition on OVEN

Chart Question Answering document understanding +9

Paper
Code

Slow-Fast Auditory Streams For Audio Recognition

2 code implementations • 5 Mar 2021 • Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen

We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs.

Ranked #1 on Human Interaction Recognition on EPIC-SOUNDS

Audio Classification Human Interaction Recognition

Paper
Code

Cough Against COVID: Evidence of COVID-19 Signature in Cough Sounds

1 code implementation • 17 Sep 2020 • Piyush Bagad, Aman Dalmia, Jigar Doshi, Arsha Nagrani, Parag Bhamare, Amrita Mahale, Saurabh Rane, Neeraj Agarwal, Rahul Panicker

Testing capacity for COVID-19 remains a challenge globally due to the lack of adequate supplies, trained personnel, and sample-processing equipment.

Paper
Code

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

1 code implementation • 1 Nov 2021 • Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen

We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance.

Action Recognition Language Modelling

Paper
Code

VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

1 code implementation • 20 Feb 2023 • Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022.

Speaker Diarization Speaker Recognition +1

Paper
Code

A CLIP-Hitchhiker's Guide to Long Video Retrieval

1 code implementation • 17 May 2022 • Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman

Our goal in this paper is the adaptation of image-text models for long video retrieval.

Ranked #4 on Zero-Shot Action Recognition on Charades

Retrieval Video Retrieval +1

Paper
Code

Learnable PINs: Cross-Modal Embeddings for Person Identity

1 code implementation • ECCV 2018 • Arsha Nagrani, Samuel Albanie, Andrew Zisserman

We propose and investigate an identity sensitive joint embedding of face and voice.

Cross-Modal Retrieval Retrieval

Paper
Code

Seeing Voices and Hearing Faces: Cross-modal biometric matching

no code implementations • CVPR 2018 • Arsha Nagrani, Samuel Albanie, Andrew Zisserman

We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching, (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available), and (iii) we use human testing as a baseline to calibrate the difficulty of the task.

Face Recognition Speaker Identification

Paper
Add Code

From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script

no code implementations • 31 Jan 2018 • Arsha Nagrani, Andrew Zisserman

The goal of this paper is the automatic identification of characters in TV and feature film material.

Speaker Identification

Paper
Add Code

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

no code implementations • 16 Aug 2018 • Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets.

Ranked #3 on Facial Expression Recognition (FER) on FERPlus

Facial Emotion Recognition Facial Expression Recognition (FER) +1

Paper
Add Code

Count, Crop and Recognise: Fine-Grained Recognition in the Wild

no code implementations • 19 Sep 2019 • Max Bain, Arsha Nagrani, Daniel Schofield, Andrew Zisserman

The goal of this paper is to label all the animal individuals present in every frame of a video.

Paper
Add Code

WiCV 2019: The Sixth Women In Computer Vision Workshop

no code implementations • 23 Sep 2019 • Irene Amerini, Elena Balashova, Sayna Ebrahimi, Kathryn Leonard, Arsha Nagrani, Amaia Salvador

In this paper we present the Women in Computer Vision Workshop - WiCV 2019, organized in conjunction with CVPR 2019.

Paper
Add Code

VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

no code implementations • 5 Dec 2019 • Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman

The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.

Speaker Recognition

Paper
Add Code

Disentangled Speech Embeddings using Cross-modal Self-supervision

no code implementations • 20 Feb 2020 • Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman

The objective of this paper is to learn representations of speaker identity without access to manually annotated data.

Self-Supervised Learning Speaker Recognition

Paper
Add Code

Speech2Action: Cross-modal Supervision for Action Recognition

no code implementations • CVPR 2020 • Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.

Action Recognition

Paper
Add Code

Spot the conversation: speaker diarisation in the wild

no code implementations • 2 Jul 2020 • Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman

Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.

Speaker Verification

Paper
Add Code

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

no code implementations • ECCV 2020 • Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid

Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind.

Action Detection Action Recognition +2

Paper
Add Code

Look Before you Speak: Visually Contextualized Utterances

no code implementations • CVPR 2021 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone.

Paper
Add Code

VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge

no code implementations • 12 Dec 2020 • Arsha Nagrani, Joon Son Chung, Jaesung Huh, Andrew Brown, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman

We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020.

Speaker Recognition

Paper
Add Code

WiCV 2020: The Seventh Women In Computer Vision Workshop

no code implementations • 11 Jan 2021 • Hazel Doughty, Nour Karessli, Kathryn Leonard, Boyi Li, Carianne Martinez, Azadeh Mobasher, Arsha Nagrani, Srishti Yadav

It provides a voice to a minority (female) group in computer vision community and focuses on increasingly the visibility of these researchers, both in academia and industry.

Paper
Add Code

Composable Augmentation Encoding for Video Representation Learning

no code implementations • ICCV 2021 • Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid

We focus on contrastive methods for self-supervised video representation learning.

Action Recognition Contrastive Learning +2

Paper
Add Code

Recognizing Multimodal Entailment

no code implementations • ACL 2021 • Cesar Ilharco, Afsaneh Shirazi, Arjun Gopalan, Arsha Nagrani, Blaz Bratanic, Chris Bregler, Christina Funk, Felipe Ferreira, Gabriel Barcik, Gabriel Ilharco, Georg Osang, Jannis Bulian, Jared Frank, Lucas Smaira, Qin Cao, Ricardo Marino, Roma Patel, Thomas Leung, Vaiva Imbrasaite

How information is created, shared and consumed has changed rapidly in recent decades, in part thanks to new social platforms and technologies on the web.

Graph Learning Question Answering

Paper
Add Code

Masking Modalities for Cross-modal Video Retrieval

no code implementations • 1 Nov 2021 • Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.

Retrieval Video Retrieval

Paper
Add Code

Audio-Visual Synchronisation in the wild

no code implementations • 8 Dec 2021 • Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.

Lip Reading

Paper
Add Code

End-to-end Generative Pretraining for Multimodal Video Captioning

no code implementations • CVPR 2022 • Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid

Recent video and language pretraining frameworks lack the ability to generate sentences.

Ranked #13 on Video Captioning on MSR-VTT (using extra training data)

Action Classification Retrieval +4

Paper
Add Code

Learning Audio-Video Modalities from Image Captions

no code implementations • 1 Apr 2022 • Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.

Ranked #6 on Zero-shot Text to Audio Retrieval on AudioCaps

Image Captioning Retrieval +4

Paper
Add Code

M&M Mix: A Multimodal Multiview Transformer Ensemble

no code implementations • 20 Jun 2022 • Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge.

Ranked #2 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Recognition Video Recognition

Paper
Add Code

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

no code implementations • 14 Aug 2022 • Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid

In this work, we focus on summarizing instructional videos, an under-explored area of video summarization.

Video Summarization

Paper
Add Code

AVATAR submission to the Ego4D AV Transcription Challenge

no code implementations • 18 Nov 2022 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022.

Paper
Add Code

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

no code implementations • CVPR 2023 • Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

(ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech).

Automatic Speech Recognition Domain Adaptation +2

Paper
Add Code

VicTR: Video-conditioned Text Representations for Activity Recognition

no code implementations • 5 Apr 2023 • Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo

In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.

Ranked #8 on Action Classification on Charades

Action Classification Activity Recognition +1

Paper
Add Code

LanSER: Language-Model Supported Speech Emotion Recognition

no code implementations • 7 Sep 2023 • Taesik Gong, Josh Belanich, Krishna Somandepalli, Arsha Nagrani, Brian Eoff, Brendan Jou

Speech emotion recognition (SER) models typically rely on costly human-labeled data for training, making scaling methods to large speech datasets and nuanced emotion taxonomies difficult.

Automatic Speech Recognition Language Modelling +5

Paper
Add Code

AutoAD II: The Sequel - Who, When, and What in Movie Audio Description

no code implementations • ICCV 2023 • Tengda Han, Max Bain, Arsha Nagrani, Gul Varol, Weidi Xie, Andrew Zisserman

Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences.

Language Modelling Text Generation

Paper
Add Code

AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description

no code implementations • 10 Oct 2023 • Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences.

Language Modelling Text Generation

Paper
Add Code

VidChapters-7M: Video Chapters at Scale

no code implementations • NeurIPS 2023 • Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid

To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.

Dense Video Captioning Navigate

Paper
Add Code

Video Summarization: Towards Entity-Aware Captions

no code implementations • 1 Dec 2023 • Hammad A. Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang, Anurag Arnab, Feng Han, Yukun Zhu, Jialu Liu, Shih-Fu Chang

We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task.

Image Captioning Video Captioning +2

Paper
Add Code

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

no code implementations • 9 Apr 2024 • Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, Cordelia Schmid

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework.

Question Answering Video Question Answering

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.