no code implementations • ECCV 2020 • Yang Liu, Qingchao Chen, Andrew Zisserman
In this paper we introduce two methods to amplify key cues in the image, and also a method to combine these and other cues when considering the interaction between a human and an object.
no code implementations • 29 Nov 2024 • Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean
We summarise in this report the challenge tasks and results, and introduce in detail the novel hour-long video QA benchmark 1h-walk VQA.
no code implementations • 18 Nov 2024 • Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek, Andrew Zisserman
We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids.
2 code implementations • 13 Nov 2024 • Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
We discuss some consistent issues on how RepNet has been evaluated in various papers.
1 code implementation • 22 Oct 2024 • Robin Y. Park, Rhydian Windsor, Amir Jamaludin, Andrew Zisserman
We propose a general pipeline to automate the extraction of labels from radiology reports using large language models, which we validate on spinal MRI reports.
no code implementations • 15 Oct 2024 • Toby Perrett, Tengda Han, Dima Damen, Andrew Zisserman
We introduce two benchmarks for unique captioning, based on egocentric footage and timeloop movies - where repeating actions are common.
no code implementations • 14 Oct 2024 • Jaesung Huh, Andrew Zisserman
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
no code implementations • 27 Aug 2024 • Jaesung Huh, Joon Son Chung, Arsha Nagrani, Andrew Brown, Jee-weon Jung, Daniel Garcia-Romero, Andrew Zisserman
In this paper, we provide a review of these challenges that covers: what they explored; the methods developed by the challenge participants and how these evolved; and also the current state of the field for speaker verification and diarisation.
no code implementations • 19 Aug 2024 • Yash Bhalgat, Vadim Tschernezki, Iro Laina, João F. Henriques, Andrea Vedaldi, Andrew Zisserman
Egocentric videos present unique challenges for 3D scene understanding due to rapid camera motion, frequent object occlusions, and limited object visibility.
1 code implementation • 1 Aug 2024 • Ragav Sachdeva, Gyungin Shin, Andrew Zisserman
Enabling engagement of manga by visually impaired individuals presents a significant challenge due to its inherently visual nature.
no code implementations • 24 Jul 2024 • Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Andrew Zisserman
The model is trained and evaluated on the OVR dataset, and its performance assessed with and without using text to specify the target class to count.
1 code implementation • 22 Jul 2024 • Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner.
2 code implementations • 8 Jul 2024 • Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, João Carreira, Andrew Zisserman, Gabriel Brostow, Carl Doersch
We introduce a new benchmark, TAPVid-3D, for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D).
2 code implementations • 5 Jul 2024 • Niki Amini-Naieni, Tengda Han, Andrew Zisserman
The goal of this paper is to improve the generality and accuracy of open-vocabulary object counting in images.
Ranked #1 on Object Counting on FSC147
1 code implementation • CVPR 2024 • Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman
We show that DenseAV can discover the ``meaning'' of words and the ``location'' of sounds without explicit localization supervision.
Ranked #1 on Speech Prompted Semantic Segmentation on ADE20K
no code implementations • 16 May 2024 • Charles Raude, K R Prajwal, Liliane Momeni, Hannah Bull, Samuel Albanie, Andrew Zisserman, Gül Varol
To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
no code implementations • 25 Apr 2024 • Charig Yang, Weidi Xie, Andrew Zisserman
We also introduce a transformer-based model for ordering of image sequences of arbitrary length with built-in attribution maps.
no code implementations • 22 Apr 2024 • Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names.
1 code implementation • 18 Apr 2024 • Junyu Xie, Charig Yang, Weidi Xie, Andrew Zisserman
The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video.
1 code implementation • CVPR 2024 • Jacob Chalk, Jaesung Huh, Evangelos Kazakos, Andrew Zisserman, Dima Damen
We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events.
no code implementations • 18 Mar 2024 • Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar
The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions.
no code implementations • 16 Mar 2024 • Yash Bhalgat, Iro Laina, João F. Henriques, Andrew Zisserman, Andrea Vedaldi
To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities.
no code implementations • 29 Feb 2024 • Andreea-Maria Oncescu, João F. Henriques, Andrew Zisserman, Samuel Albanie, A. Sophia Koepke
Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset.
2 code implementations • 1 Feb 2024 • Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew Zisserman
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes.
Ranked #1 on Point Tracking on TAP-Vid-RGB-Stacking
2 code implementations • 29 Jan 2024 • Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse.
no code implementations • 22 Jan 2024 • Bruno Korbar, Jaesung Huh, Andrew Zisserman
The goal of this paper is automatic character-aware subtitle generation.
1 code implementation • CVPR 2024 • Ragav Sachdeva, Andrew Zisserman
In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation.
no code implementations • CVPR 2024 • Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names.
1 code implementation • CVPR 2024 • Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman
In contrast, we use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for partially occluded objects in real images.
no code implementations • 20 Dec 2023 • Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark.
no code implementations • 19 Dec 2023 • Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari
In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task.
Ranked #15 on Video Question Answering on NExT-QA
no code implementations • 18 Dec 2023 • Junyu Xie, Weidi Xie, Andrew Zisserman
The goal of this paper is to discover, segment, and track independently moving objects in complex visual scenes.
no code implementations • CVPR 2024 • Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe Heyward, Viorica Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman, Aida Nematzdeh
Understanding long, real-world videos requires modeling of long-range visual dependencies.
no code implementations • CVPR 2024 • João Carreira, Michael King, Viorica Pătrăucean, Dilara Gokay, Cătălin Ionescu, Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar, Dima Damen, Andrew Zisserman
We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling.
no code implementations • 15 Nov 2023 • Amir Jamaludin, Timor Kadir, Emma Clark, Andrew Zisserman
Our objective in this paper is to estimate spine curvature in DXA scans.
no code implementations • 25 Oct 2023 • Jianbo Jiao, Mohammad Alsharid, Lior Drukker, Aris T. Papageorghiou, Andrew Zisserman, J. Alison Noble
Auditory and visual signals usually present together and correlate with each other, not only in natural environments but also in clinical settings.
no code implementations • 10 Oct 2023 • Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences.
1 code implementation • 10 Oct 2023 • Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman
To this end, we make the following contributions: (i) We introduce a general and lightweight protocol to evaluate whether features of an off-the-shelf large vision model encode a number of physical 'properties' of the 3D scene, by training discriminative classifiers on the features for these properties.
1 code implementation • 8 Oct 2023 • Sindhu B Hegde, Andrew Zisserman
In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person's gestures are correlated with their speech or not.
Ranked #1 on Active Speaker Detection on LRS3-TED
no code implementations • ICCV 2023 • Hala Lamdouar, Weidi Xie, Andrew Zisserman
We also incorporate the proposed camouflage score into a generative model as an auxiliary loss and show that effective camouflage images or videos can be synthesised in a scalable manner.
1 code implementation • 21 Aug 2023 • Ragav Sachdeva, Andrew Zisserman
The goal of this paper is to detect what has changed, if anything, between two "in the wild" images of the same 3D scene acquired from different camera positions and at different temporal instances.
1 code implementation • ICCV 2023 • Chuhan Zhang, Ankush Gupta, Andrew Zisserman
We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i. e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e. g. Episodic Memory in Ego4D).
1 code implementation • 18 Jul 2023 • Jaesung Huh, Max Bain, Andrew Zisserman
This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team.
3 code implementations • ICCV 2023 • Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Ranked #1 on Visual Tracking on Kinetics
no code implementations • 8 Jun 2023 • Prannay Kaul, Weidi Xie, Andrew Zisserman
The goal of this paper is open-vocabulary object detection (OVOD) $\unicode{x2013}$ building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining.
1 code implementation • NeurIPS 2023 • Yash Bhalgat, Iro Laina, João F. Henriques, Andrew Zisserman, Andrea Vedaldi
Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets, as well as on our newly created Messy Rooms dataset, demonstrating the effectiveness and scalability of our slow-fast clustering method.
1 code implementation • 2 Jun 2023 • Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, Andrew Zisserman
Our objective is open-world object counting in images, where the target object class is specified by a text description.
Ranked #4 on Zero-Shot Counting on FSC147
1 code implementation • NeurIPS 2023 • Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e. g. Flamingo, SeViLA, or GPT-4).
Ranked #1 on Point Tracking on Perception Test
1 code implementation • ICCV 2023 • Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid
Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time.
Ranked #21 on Zero-Shot Video Question Answer on NExT-QA
no code implementations • 30 Mar 2023 • Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman
This paper explores training medical vision-language models (VLMs) -- where the visual and language inputs are embedded into a common space -- with a particular focus on scenarios where training data is limited, as is often the case in clinical datasets.
1 code implementation • CVPR 2023 • Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form.
no code implementations • 23 Mar 2023 • Relja Arandjelović, Alex Andonian, Arthur Mensch, Olivier J. Hénaff, Jean-Baptiste Alayrac, Andrew Zisserman
The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
1 code implementation • 20 Feb 2023 • Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman
This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022.
1 code implementation • 1 Feb 2023 • Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew Zisserman
We introduce Epic-Sounds, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos.
1 code implementation • 23 Jan 2023 • Adrià Recasens, Jason Lin, Joāo Carreira, Drew Jaegle, Luyu Wang, Jean-Baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, Andrew Zisserman
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering.
no code implementations • ICCV 2023 • Tengda Han, Max Bain, Arsha Nagrani, Gul Varol, Weidi Xie, Andrew Zisserman
Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences.
no code implementations • CVPR 2023 • Yash Bhalgat, Joao F. Henriques, Andrew Zisserman
Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors.
1 code implementation • 16 Nov 2022 • K R Prajwal, Hannah Bull, Liliane Momeni, Samuel Albanie, Gül Varol, Andrew Zisserman
Through extensive evaluations, we verify our method for automatic annotation and our model architecture.
3 code implementations • 7 Nov 2022 • Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move.
no code implementations • 26 Oct 2022 • Bruno Korbar, Andrew Zisserman
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
1 code implementation • 18 Oct 2022 • Guanqi Zhan, Weidi Xie, Andrew Zisserman
To this end we make the following four contributions: (1) We propose a simple 'plugin' module for the detection head of two-stage object detectors to improve the recall of partially occluded objects.
Ranked #1 on Instance Segmentation on Separated COCO
2 code implementations • 13 Oct 2022 • Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space.
no code implementations • 10 Oct 2022 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is an efficient training method for video tasks.
no code implementations • 6 Oct 2022 • Olivia Wiles, Joao Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski
In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos.
no code implementations • 28 Sep 2022 • Ragav Sachdeva, Andrew Zisserman
We live in a dynamic world where things change all the time.
1 code implementation • 29 Aug 2022 • Chang Liu, Yujie Zhong, Andrew Zisserman, Weidi Xie
In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i. e. zero-shot or few-shot counting.
Ranked #3 on Exemplar-Free Counting on FSC147
no code implementations • 4 Aug 2022 • Liliane Momeni, Hannah Bull, K R Prajwal, Samuel Albanie, Gül Varol, Andrew Zisserman
Recently, sign language researchers have turned to sign language interpreted TV broadcasts, comprising (i) a video of continuous signing and (ii) subtitles corresponding to the audio content, as a readily available and large-scale source of training data.
no code implementations • 20 Jul 2022 • Chuhan Zhang, Ankush Gupta, Andrew Zisserman
The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory 'modalities' of the video clip.
1 code implementation • 5 Jul 2022 • Junyu Xie, Weidi Xie, Andrew Zisserman
The objective of this paper is a model that is able to discover, track and segment multiple moving objects in a video.
Ranked #3 on Unsupervised Object Segmentation on FBMS-59
no code implementations • 27 Jun 2022 • Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman
(iii) We also apply SCT to an existing problem: radiological grading of inter-vertebral discs (IVDs) in lumbar MR scans for common degenerative changes. We show that by considering the context of vertebral bodies in the image, SCT improves the accuracy for several gradings compared to previously published model.
1 code implementation • 17 May 2022 • Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
Our goal in this paper is the adaptation of image-text models for long video retrieval.
Ranked #4 on Zero-Shot Action Recognition on Charades
no code implementations • 9 May 2022 • Gül Varol, Liliane Momeni, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman
The focus of this work is $\textit{sign spotting}$ - given a video of an isolated sign, our task is to identify $\textit{whether}$ and $\textit{where}$ it has been signed in a continuous, co-articulated sign language video.
no code implementations • 3 May 2022 • Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman
This technical report presents SpineNetV2, an automated tool which: (i) detects and labels vertebral bodies in clinical spinal magnetic resonance (MR) scans across a range of commonly used sequences; and (ii) performs radiological grading of lumbar intervertebral discs in T2-weighted scans for a range of common degenerative changes.
5 code implementations • DeepMind 2022 • Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
Ranked #1 on Action Recognition on RareAct
1 code implementation • CVPR 2022 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment.
1 code implementation • 16 Mar 2022 • Olivier J. Hénaff, Skanda Koppula, Evan Shelhamer, Daniel Zoran, Andrew Jaegle, Andrew Zisserman, João Carreira, Relja Arandjelović
The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks.
2 code implementations • 22 Feb 2022 • Joao Carreira, Skanda Koppula, Daniel Zoran, Adria Recasens, Catalin Ionescu, Olivier Henaff, Evan Shelhamer, Relja Arandjelovic, Matt Botvinick, Oriol Vinyals, Karen Simonyan, Andrew Zisserman, Andrew Jaegle
This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video.
1 code implementation • CVPR 2022 • Sagar Vaze, Kai Han, Andrea Vedaldi, Andrew Zisserman
Here, the unlabelled images may come from labelled classes or from novel ones.
Ranked #1 on Open-World Semi-Supervised Learning on CIFAR-10 (Seen accuracy (50% Labeled) metric)
Fine-Grained Visual Recognition Open-World Semi-Supervised Learning +1
no code implementations • CVPR 2022 • Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman
The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities.
2 code implementations • 13 Dec 2021 • Michael P. J. Camilleri, Li Zhang, Rasneer S. Bains, Andrew Zisserman, Christopher K. I. Williams
Our objective is to locate and provide a unique identifier for each mouse in a cluttered home-cage environment through time, as a precursor to automated behaviour recognition for biological research.
1 code implementation • CVPR 2022 • Prannay Kaul, Weidi Xie, Andrew Zisserman
The objective of this paper is few-shot object detection (FSOD) -- the task of expanding an object detector for a new category given only a few instances for training.
no code implementations • 8 Dec 2021 • Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
no code implementations • CVPR 2022 • Wang Yifan, Carl Doersch, Relja Arandjelović, João Carreira, Andrew Zisserman
Much of the recent progress in 3D vision has been driven by the development of specialized architectures that incorporate geometrical inductive biases.
no code implementations • CVPR 2022 • Charig Yang, Weidi Xie, Andrew Zisserman
In this paper, we present a framework for reading analog clocks in natural images or videos.
no code implementations • 5 Nov 2021 • Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland, Andrew Zisserman
In this work, we introduce the BBC-Oxford British Sign Language (BOBSL) dataset, a large-scale video collection of British Sign Language (BSL).
1 code implementation • 1 Nov 2021 • Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen
We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance.
1 code implementation • 29 Oct 2021 • K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman
In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting.
Ranked #1 on Visual Keyword Spotting on LRS2
no code implementations • CVPR 2022 • K R Prajwal, Triantafyllos Afouras, Andrew Zisserman
To this end, we make the following contributions: (1) we propose an attention-based pooling mechanism to aggregate visual speech representations; (2) we use sub-word units for lip reading for the first time and show that this allows us to better model the ambiguities of the task; (3) we propose a model for Visual Speech Detection (VSD), trained on top of the lip reading network.
Ranked #1 on Visual Speech Recognition on LRS2 (using extra training data)
Audio-Visual Active Speaker Detection Automatic Speech Recognition +5
2 code implementations • ICLR 2022 • Sagar Vaze, Kai Han, Andrea Vedaldi, Andrew Zisserman
In this paper, we first demonstrate that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes.
Ranked #10 on Out-of-Distribution Detection on CIFAR-100 vs CIFAR-10
1 code implementation • NeurIPS Workshop ImageNet_PPF 2021 • Yuki M. Asano, Christian Rupprecht, Andrew Zisserman, Andrea Vedaldi
On the other hand, state-of-the-art pretraining is nowadays obtained with unsupervised methods, meaning that labelled datasets such as ImageNet may not be necessary, or perhaps not even optimal, for model pretraining.
8 code implementations • ICLR 2022 • Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, Joāo Carreira
A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible.
Ranked #1 on Optical Flow Estimation on KITTI 2015 (Average End-Point Error metric)
1 code implementation • 14 Jul 2021 • Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman
This paper explores the use of self-supervised deep learning in medical imaging in cases where two scan modalities are available for the same subject.
1 code implementation • 29 Jun 2021 • Kai Han, Sylvestre-Alvise Rebuffi, Sébastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman
We present a new approach called AutoNovel to address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labelled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use ranking statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data.
Ranked #1 on Novel Class Discovery on SVHN
no code implementations • 9 Jun 2021 • Relja Arandjelović, Andrew Zisserman
In this work we address a clear limitation of the vanilla coarse-to-fine approach -- that it is based on a heuristic and not trained end-to-end for the task at hand.
no code implementations • 20 May 2021 • Andrew Brown, Vicky Kalogeiton, Andrew Zisserman
In this paper we make contributions to address both these deficiencies: first, we introduce a Multi-Modal High-Precision Clustering algorithm for person-clustering in videos using cues from several modalities (face, body, and voice).
no code implementations • 20 May 2021 • Leonard Berrada, Andrew Zisserman, M. Pawan Kumar
This is a short note on the performance of the ALI-G algorithm (Berrada et al., 2020) as reported in (Loizou et al., 2021).
no code implementations • CVPR 2021 • Erika Lu, Forrester Cole, Tali Dekel, Andrew Zisserman, William T. Freeman, Michael Rubinstein
We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semi-transparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject.
no code implementations • ICCV 2021 • Hannah Bull, Triantafyllos Afouras, Gül Varol, Samuel Albanie, Liliane Momeni, Andrew Zisserman
The goal of this work is to temporally align asynchronous subtitles in sign language videos.
4 code implementations • ICCV 2021 • Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
On semi-supervised learning benchmarks we improve performance significantly when only 1% ImageNet labels are available, from 53. 8% to 56. 5%.
Ranked #1 on Image Classification on PASCAL VOC 2007
no code implementations • CVPR 2021 • Chuhan Zhang, Ankush Gupta, Andrew Zisserman
It attends to relevant segments for each query with a temporal attention mechanism, and can be trained using only the labels for each query.
Ranked #13 on Action Recognition on Diving-48
1 code implementation • ICCV 2021 • Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, Yang Liu
In recent years, considerable progress on the task of text-video retrieval has been achieved by leveraging large-scale pretraining on visual and audio datasets to construct powerful video encoders.
no code implementations • ICCV 2021 • Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, Weidi Xie
We additionally evaluate on a challenging camouflage dataset (MoCA), significantly outperforming the other self-supervised approaches, and comparing favourably to the top supervised approach, highlighting the importance of motion cues, and the potential bias towards visual appearance in existing video segmentation models.
Ranked #7 on Unsupervised Object Segmentation on DAVIS 2016
1 code implementation • CVPR 2021 • Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset.
5 code implementations • ICCV 2021 • Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.
Ranked #4 on Video Retrieval on QuerYD (using extra training data)
no code implementations • CVPR 2021 • Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
We also extend our method to the video domain, improving the state of the art on the VATEX dataset.
no code implementations • CVPR 2021 • Gül Varol, Liliane Momeni, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman
Our contributions are as follows: (1) we demonstrate the ability to leverage large quantities of continuous signing videos with weakly-aligned subtitles to localise signs in continuous sign language; (2) we employ the learned attention to automatically generate hundreds of thousands of annotations for a large sign vocabulary; (3) we collect a set of 37K manually verified sign instances across a vocabulary of 950 sign classes to support our study of sign language recognition; (4) by training on the newly annotated data from our method, we outperform the prior state of the art on the BSL-1K sign language recognition benchmark.
1 code implementation • ICCV 2021 • Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Ross Hemsley, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-bastien Grill, Aäron van den Oord, Andrew Zisserman
Most successful self-supervised learning methods are trained to align the representations of two independent views from the data.
2 code implementations • 5 Mar 2021 • Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs.
Ranked #1 on Human Interaction Recognition on EPIC-SOUNDS
11 code implementations • 4 Mar 2021 • Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira
The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models.
Ranked #35 on 3D Point Cloud Classification on ModelNet40 (Mean Accuracy metric)
no code implementations • 10 Feb 2021 • Andrew Brown, Ernesto Coto, Andrew Zisserman
We present a method for automatically labelling all faces in video archives, such as TV broadcasts, by combining multiple evidence sources and multiple modalities (visual and audio).
1 code implementation • 6 Jan 2021 • Manuel J. Marin-Jimenez, Vicky Kalogeiton, Pablo Medina-Suarez, Andrew Zisserman
For this purpose, we propose LAEO-Net++, a new deep CNN for determining LAEO in videos.
no code implementations • 12 Dec 2020 • Arsha Nagrani, Joon Son Chung, Jaesung Huh, Andrew Brown, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman
We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020.
no code implementations • 23 Nov 2020 • Hala Lamdouar, Charig Yang, Weidi Xie, Andrew Zisserman
We make the following three contributions: (i) We propose a novel architecture that consists of two essential components for breaking camouflage, namely, a differentiable registration module to align consecutive frames based on the background, which effectively emphasises the object boundary in the difference image, and a motion segmentation module with memory that discovers the moving objects, while maintaining the object permanence even when motion is absent at some point.
2 code implementations • 22 Nov 2020 • Andreea-Maria Oncescu, João F. Henriques, Yang Liu, Andrew Zisserman, Samuel Albanie
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
no code implementations • 21 Oct 2020 • Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, Andrew Zisserman
We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset.
1 code implementation • NeurIPS 2020 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is visual-only self-supervised video representation learning.
Ranked #8 on Self-Supervised Action Recognition Linear on HMDB51
1 code implementation • 8 Oct 2020 • Liliane Momeni, Gül Varol, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman
The focus of this work is sign spotting - given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.
1 code implementation • 16 Sep 2020 • Erika Lu, Forrester Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, David Salesin, William T. Freeman, Michael Rubinstein
We present a method for retiming people in an ordinary, natural video -- manipulating and editing the time in which different motions of individuals in the video occur.
no code implementations • ECCV 2020 • Chuhan Zhang, Ankush Gupta, Andrew Zisserman
In this work, our objective is to address the problems of generalization and flexibility for text recognition in documents.
1 code implementation • 2 Sep 2020 • Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, Andrew Zisserman
The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio.
no code implementations • 1 Sep 2020 • Weidi Xie, Jeffrey Byrne, Andrew Zisserman
We describe three use cases on the public IJB-C face verification benchmark: (i) to improve 1:1 image-based verification error rates by rejecting low-quality face images; (ii) to improve quality score based fusion performance on the 1:1 set-based verification benchmark; and (iii) its use as a quality measure for selecting high quality (unblurred, good lighting, more frontal) faces from a collection, e. g. for automatic enrolment or display.
1 code implementation • ECCV 2020 • Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning.
1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
1 code implementation • 3 Aug 2020 • Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes".
1 code implementation • ECCV 2020 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is self-supervised learning from video, in particular for representations for action recognition.
2 code implementations • ECCV 2020 • Andrew Brown, Weidi Xie, Vicky Kalogeiton, Andrew Zisserman
Optimising a ranking-based metric, such as Average Precision (AP), is notoriously challenging due to the fact that it is non-differentiable, and hence cannot be optimised directly using gradient-descent methods.
Ranked #4 on Vehicle Re-Identification on VehicleID Medium
1 code implementation • ECCV 2020 • Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, Andrew Zisserman
Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality.
Ranked #6 on Sign Language Recognition on WLASL-2000
5 code implementations • NeurIPS 2020 • Carl Doersch, Ankush Gupta, Andrew Zisserman
In this work, we illustrate how the neural network representations which underpin modern vision systems are subject to supervision collapse, whereby they lose any information that is not necessary for performing the training task, including information that may be necessary for transfer to new tasks or domains.
no code implementations • CVPR 2021 • Olivia Wiles, Sebastien Ehrhardt, Andrew Zisserman
We propose a new approach to determine correspondences between image pairs in the wild under large changes in illumination, viewpoint, context, and material.
no code implementations • 6 Jul 2020 • Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman
We propose a novel convolutional method for the detection and identification of vertebrae in whole spine MRIs.
no code implementations • 2 Jul 2020 • Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman
Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.
1 code implementation • NeurIPS 2020 • Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman
In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.
2 code implementations • CVPR 2020 • Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
We present an approach for estimating the period with which an action is repeated in a video.
Ranked #1 on Repetitive Action Counting on Countix
1 code implementation • 17 Jun 2020 • Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Kai Han, Andrea Vedaldi, Andrew Zisserman
We present LSD-C, a novel method to identify clusters in an unlabeled dataset.
1 code implementation • 8 May 2020 • Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman
Our objective in this work is long range understanding of the narrative structure of movies.
no code implementations • 1 May 2020 • Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, Andrew Zisserman
The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips.
3 code implementations • 29 Apr 2020 • Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman
Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques.
no code implementations • 13 Apr 2020 • Robert McCraith, Lukas Neumann, Andrew Zisserman, Andrea Vedaldi
Recent advances in self-supervised learning havedemonstrated that it is possible to learn accurate monoculardepth reconstruction from raw video data, without using any 3Dground truth for supervision.
no code implementations • CVPR 2020 • Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.
no code implementations • 26 Mar 2020 • Yujie Zhong, Relja Arandjelović, Andrew Zisserman
The objective of this work is to learn a compact embedding of a set of descriptors that is suitable for efficient retrieval and ranking, whilst maintaining discriminability of the individual descriptors.
1 code implementation • CVPR 2020 • Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman
Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages.
no code implementations • 20 Feb 2020 • Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman
The objective of this paper is to learn representations of speaker identity without access to manually annotated data.
1 code implementation • ICLR 2020 • Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman
In this work we address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data.
4 code implementations • CVPR 2020 • Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman
Annotating videos is cumbersome, expensive and not scalable.
Ranked #3 on Action Recognition on RareAct
1 code implementation • 9 Dec 2019 • Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman
Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.
no code implementations • 5 Dec 2019 • Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman
The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.
no code implementations • 28 Nov 2019 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data.
Ranked #19 on Lipreading on LRS3-TED (using extra training data)
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 28 Oct 2019 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman
This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information.
no code implementations • ICCV 2019 • Jean-Baptiste Alayrac, João Carreira, Relja Arandjelović, Andrew Zisserman
The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to.
no code implementations • 19 Sep 2019 • Max Bain, Arsha Nagrani, Daniel Schofield, Andrew Zisserman
The goal of this paper is to label all the animal individuals present in every frame of a video.
1 code implementation • 10 Sep 2019 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition.
Ranked #10 on Self-Supervised Action Recognition Linear on UCF101
Representation Learning Self-Supervised Action Recognition +3
no code implementations • 6 Sep 2019 • Dan Xu, Weidi Xie, Andrew Zisserman
In this paper we propose a geometry-aware model for video object detection.
1 code implementation • ICCV 2019 • Kai Han, Andrea Vedaldi, Andrew Zisserman
The second contribution is a method to estimate the number of classes in the unlabelled data.