7 code implementations • ICLR 2022 • Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, Joāo Carreira
A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible.
Ranked #1 on Optical Flow Estimation on KITTI 2015 (Average End-Point Error metric)
299 code implementations • 4 Sep 2014 • Karen Simonyan, Andrew Zisserman
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting.
Ranked #2 on Classification on InDL
3 code implementations • CVPR 2016 • Ankush Gupta, Andrea Vedaldi, Andrew Zisserman
In this paper we introduce a new method for text detection in natural images.
Ranked #12 on Scene Text Detection on ICDAR 2013
2 code implementations • CVPR 2019 • Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
We introduce a self-supervised representation learning method based on the task of temporal alignment between videos.
Ranked #1 on Video Alignment on UPenn Action
23 code implementations • 23 Oct 2017 • Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, Andrew Zisserman
The dataset was collected with three goals in mind: (i) to have both a large number of identities and also a large number of images for each identity; (ii) to cover a large range of pose, age and ethnicity; and (iii) to minimize the label noise.
Ranked #1 on Face Verification on IJB-C (training dataset metric)
1 code implementation • NeurIPS 2020 • Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman
In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.
10 code implementations • 4 Mar 2021 • Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira
The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models.
Ranked #29 on Audio Classification on AudioSet
6 code implementations • NeurIPS 2019 • Tejas Kulkarni, Ankush Gupta, Catalin Ionescu, Sebastian Borgeaud, Malcolm Reynolds, Andrew Zisserman, Volodymyr Mnih
In this work we aim to learn object representations that are useful for control and reinforcement learning (RL).
1 code implementation • 18 Jul 2023 • Jaesung Huh, Max Bain, Andrew Zisserman
This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team.
45 code implementations • NeurIPS 2015 • Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu
Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner.
21 code implementations • 20 Dec 2013 • Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets).
33 code implementations • CVPR 2017 • Joao Carreira, Andrew Zisserman
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks.
4 code implementations • DeepMind 2022 • Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
Ranked #1 on Action Recognition on RareAct
5 code implementations • ICCV 2021 • Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.
Ranked #4 on Video Retrieval on QuerYD (using extra training data)
1 code implementation • ICCV 2023 • Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid
Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time.
Ranked #11 on Zero-Shot Video Question Answer on NExT-QA
4 code implementations • ICCV 2021 • Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
On semi-supervised learning benchmarks we improve performance significantly when only 1% ImageNet labels are available, from 53. 8% to 56. 5%.
Ranked #1 on Image Classification on PASCAL VOC 2007
12 code implementations • 19 May 2017 • Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, Andrew Zisserman
We describe the DeepMind Kinetics human action video dataset.
3 code implementations • 7 Nov 2022 • Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move.
1 code implementation • ICCV 2023 • Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Ranked #1 on Visual Tracking on Kinetics
2 code implementations • 1 Feb 2024 • Carl Doersch, Yi Yang, Dilara Gokay, Pauline Luc, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ross Goroshin, João Carreira, Andrew Zisserman
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes.
4 code implementations • NeurIPS 2018 • Piotr Mirowski, Matthew Koichi Grimes, Mateusz Malinowski, Karl Moritz Hermann, Keith Anderson, Denis Teplyashin, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, Raia Hadsell
We present an interactive navigation environment that uses Google StreetView for its photographic content and worldwide coverage, and demonstrate that our learning method allows agents to learn to navigate multiple cities and to traverse to target destinations that may be kilometres away.
4 code implementations • NeurIPS 2020 • Carl Doersch, Ankush Gupta, Andrew Zisserman
In this work, we illustrate how the neural network representations which underpin modern vision systems are subject to supervision collapse, whereby they lose any information that is not necessary for performing the training task, including information that may be necessary for transfer to new tasks or domains.
1 code implementation • CVPR 2016 • Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information.
Ranked #60 on Action Recognition on UCF101 (using extra training data)
Action Recognition In Videos Temporal Action Localization +1
7 code implementations • NeurIPS 2014 • Karen Simonyan, Andrew Zisserman
Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art.
3 code implementations • ICCV 2017 • Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman
Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year.
4 code implementations • 30 May 2019 • Simon A. A. Kohl, Bernardino Romera-Paredes, Klaus H. Maier-Hein, Danilo Jimenez Rezende, S. M. Ali Eslami, Pushmeet Kohli, Andrew Zisserman, Olaf Ronneberger
Medical imaging only indirectly measures the molecular identity of the tissue within each voxel, which often produces only ambiguous image evidence for target measures of interest, like semantic segmentation.
1 code implementation • CVPR 2012 • Relja Arandjelović, Andrew Zisserman
The objective of this work is object retrieval in large scale image datasets, where the object is specified by an image query and retrieval should be immediate at run time in the manner of Video Google [28].
Ranked #6 on Image Matching on IMC PhotoTourism (using extra training data)
8 code implementations • Interspeech 2018 • Arsha Nagrani, Joon Son Chung, Andrew Zisserman
Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance.
Sound
9 code implementations • 26 Feb 2019 • Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman
The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.
2 code implementations • 14 Jun 2018 • Joon Son Chung, Arsha Nagrani, Andrew Zisserman
The objective of this paper is speaker recognition under noisy and unconstrained conditions.
Ranked #1 on Speaker Verification on VoxCeleb2 (using extra training data)
3 code implementations • 31 Jul 2019 • Yang Liu, Samuel Albanie, Arsha Nagrani, Andrew Zisserman
The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge.
Ranked #23 on Video Retrieval on MSVD
1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
1 code implementation • ICCV 2021 • Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, Yang Liu
In recent years, considerable progress on the task of text-video retrieval has been achieved by leveraging large-scale pretraining on visual and audio datasets to construct powerful video encoders.
1 code implementation • NeurIPS 2020 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is visual-only self-supervised video representation learning.
Ranked #12 on Self-Supervised Action Recognition on HMDB51 (finetuned)
1 code implementation • 4 Mar 2019 • Piotr Mirowski, Andras Banki-Horvath, Keith Anderson, Denis Teplyashin, Karl Moritz Hermann, Mateusz Malinowski, Matthew Koichi Grimes, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, Raia Hadsell
These datasets cannot be used for decision-making and reinforcement learning, however, and in general the perspective of navigation as an interactive learning task, where the actions and behaviours of a learning agent are learned simultaneously with the perception and planning, is relatively unsupported.
1 code implementation • NeurIPS Workshop ImageNet_PPF 2021 • Yuki M. Asano, Christian Rupprecht, Andrew Zisserman, Andrea Vedaldi
On the other hand, state-of-the-art pretraining is nowadays obtained with unsupervised methods, meaning that labelled datasets such as ImageNet may not be necessary, or perhaps not even optimal, for model pretraining.
1 code implementation • 10 Sep 2019 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition.
Ranked #33 on Self-Supervised Action Recognition on UCF101
Representation Learning Self-Supervised Action Recognition +2
2 code implementations • ICLR 2022 • Sagar Vaze, Kai Han, Andrea Vedaldi, Andrew Zisserman
In this paper, we first demonstrate that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes.
Ranked #10 on Out-of-Distribution Detection on CIFAR-100 vs CIFAR-10
1 code implementation • ICLR 2018 • Leonard Berrada, Andrew Zisserman, M. Pawan Kumar
We compare the performance of the cross-entropy loss and our margin-based losses in various regimes of noise and data size, for the predominant use case of k=5.
2 code implementations • 7 Nov 2016 • Leonard Berrada, Andrew Zisserman, M. Pawan Kumar
We present a novel layerwise optimization algorithm for the learning objective of Piecewise-Linear Convolutional Neural Networks (PL-CNNs), a large class of convolutional neural networks.
1 code implementation • ICLR 2020 • Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman
In this work we address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data.
1 code implementation • 29 Jun 2021 • Kai Han, Sylvestre-Alvise Rebuffi, Sébastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman
We present a new approach called AutoNovel to address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labelled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use ranking statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data.
Ranked #1 on Novel Class Discovery on SVHN
1 code implementation • CVPR 2019 • Anurag Arnab, Carl Doersch, Andrew Zisserman
We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos.
Ranked #1 on Monocular 3D Human Pose Estimation on Human3.6M (Use Video Sequence metric)
4 code implementations • CVPR 2020 • Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman
Annotating videos is cumbersome, expensive and not scalable.
Ranked #3 on Action Recognition on RareAct
1 code implementation • 9 Jun 2014 • Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
In this work we present a framework for the recognition of natural scene text.
Ranked #35 on Scene Text Recognition on SVT
2 code implementations • ECCV 2020 • Andrew Brown, Weidi Xie, Vicky Kalogeiton, Andrew Zisserman
Optimising a ranking-based metric, such as Average Precision (AP), is notoriously challenging due to the fact that it is non-differentiable, and hence cannot be optimised directly using gradient-descent methods.
Ranked #4 on Vehicle Re-Identification on VehicleID Medium
4 code implementations • 6 Sep 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
Ranked #6 on Audio-Visual Speech Recognition on LRS2
Audio-Visual Speech Recognition Automatic Speech Recognition (ASR) +4
1 code implementation • 18 Jan 2024 • Ragav Sachdeva, Andrew Zisserman
In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation.
1 code implementation • CVPR 2022 • Sagar Vaze, Kai Han, Andrea Vedaldi, Andrew Zisserman
Here, the unlabelled images may come from labelled classes or from novel ones.
Ranked #1 on Open-World Semi-Supervised Learning on CIFAR-10 (Seen accuracy (50% Labeled) metric)
Fine-Grained Visual Recognition Open-World Semi-Supervised Learning +1
1 code implementation • ECCV 2020 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is self-supervised learning from video, in particular for representations for action recognition.
1 code implementation • 24 Apr 2019 • Abhishek Dutta, Andrew Zisserman
In this paper, we introduce a simple and standalone manual annotation tool for images, audio and video: the VGG Image Annotator (VIA).
1 code implementation • ICCV 2019 • Kai Han, Andrea Vedaldi, Andrew Zisserman
The second contribution is a method to estimate the number of classes in the unlabelled data.
1 code implementation • 8 May 2020 • Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman
Our objective in this work is long range understanding of the narrative structure of movies.
2 code implementations • NeurIPS 2023 • Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e. g. Flamingo, SeViLA, or GPT-4).
1 code implementation • CVPR 2023 • Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form.
1 code implementation • ECCV 2020 • Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning.
1 code implementation • CVPR 2022 • Tengda Han, Weidi Xie, Andrew Zisserman
The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment.
1 code implementation • ICCV 2019 • Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i. e. the combination of modalities within a range of temporal offsets.
Ranked #2 on Egocentric Activity Recognition on EPIC-KITCHENS-55
1 code implementation • 6 May 2019 • Ammar Abbas, Andrew Zisserman
The objective of this paper is to rectify any monocular image by computing a homography matrix that transforms it to a bird's eye (overhead) view.
1 code implementation • 23 Jan 2023 • Adrià Recasens, Jason Lin, Joāo Carreira, Drew Jaegle, Luyu Wang, Jean-Baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, Andrew Zisserman
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering.
1 code implementation • ICCV 2017 • Relja Arandjelović, Andrew Zisserman
We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos?
Ranked #22 on Audio Classification on ESC-50
2 code implementations • 21 Aug 2018 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman
We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time.
Ranked #2 on Unsupervised Facial Landmark Detection on 300W
1 code implementation • 18 Oct 2022 • Guanqi Zhan, Weidi Xie, Andrew Zisserman
To this end we make the following four contributions: (1) We propose a simple 'plugin' module for the detection head of two-stage object detectors to improve the recall of partially occluded objects.
Ranked #1 on Instance Segmentation on Separated COCO
1 code implementation • CVPR 2022 • Prannay Kaul, Weidi Xie, Andrew Zisserman
The objective of this paper is few-shot object detection (FSOD) -- the task of expanding an object detector for a new category given only a few instances for training.
1 code implementation • ECCV 2020 • Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, Andrew Zisserman
Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality.
Ranked #4 on Sign Language Recognition on WLASL-2000
1 code implementation • 29 Aug 2022 • Chang Liu, Yujie Zhong, Andrew Zisserman, Weidi Xie
In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i. e. zero-shot or few-shot counting.
Ranked #3 on Object Counting on FSC147
1 code implementation • CVPR 2021 • Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset.
1 code implementation • 9 Dec 2019 • Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman
Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.
1 code implementation • 8 May 2017 • Joon Son Chung, Amir Jamaludin, Andrew Zisserman
To achieve this we propose an encoder-decoder CNN model that uses a joint embedding of the face and audio to generate synthesised talking face video frames.
2 code implementations • 5 Mar 2021 • Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs.
Ranked #1 on Human Interaction Recognition on EPIC-SOUNDS
1 code implementation • 1 Nov 2018 • Erika Lu, Weidi Xie, Andrew Zisserman
The model achieves competitive performance on cell and crowd counting datasets, and surpasses the state-of-the-art on the car dataset using only three training images.
1 code implementation • ICLR 2019 • Leonard Berrada, Andrew Zisserman, M. Pawan Kumar
Furthermore, we compare our algorithm to SGD with a hand-designed learning rate schedule, and show that it provides similar generalization while converging faster.
1 code implementation • 1 Feb 2023 • Jaesung Huh, Jacob Chalk, Evangelos Kazakos, Dima Damen, Andrew Zisserman
We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos.
1 code implementation • 2 Sep 2020 • Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, Andrew Zisserman
The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio.
1 code implementation • NeurIPS 2023 • Yash Bhalgat, Iro Laina, João F. Henriques, Andrew Zisserman, Andrea Vedaldi
Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets, as well as on our newly created Messy Rooms dataset, demonstrating the effectiveness and scalability of our slow-fast clustering method.
1 code implementation • ICCV 2021 • Adrià Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Ross Hemsley, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean, Florent Altché, Michal Valko, Jean-bastien Grill, Aäron van den Oord, Andrew Zisserman
Most successful self-supervised learning methods are trained to align the representations of two independent views from the data.
1 code implementation • 17 Jun 2020 • Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Kai Han, Andrea Vedaldi, Andrew Zisserman
We present LSD-C, a novel method to identify clusters in an unlabeled dataset.
1 code implementation • CVPR 2020 • Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman
Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages.
2 code implementations • 13 Oct 2022 • Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space.
2 code implementations • 29 Jan 2024 • Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse.
2 code implementations • 22 Nov 2020 • Andreea-Maria Oncescu, João F. Henriques, Yang Liu, Andrew Zisserman, Samuel Albanie
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
1 code implementation • 3 Aug 2020 • Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes".
1 code implementation • 3 Aug 2018 • Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, Andrew Zisserman
We describe an extension of the DeepMind Kinetics human action dataset from 400 classes, each with at least 400 video clips, to 600 classes, each with at least 600 video clips.
Ranked #61 on Action Classification on Kinetics-600
1 code implementation • 28 Dec 2023 • Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman
In contrast, we use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for partially occluded objects in real images.
1 code implementation • 2 Jun 2023 • Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, Andrew Zisserman
Our objective is open-world object counting in images, where the target object class is specified by a text description.
Ranked #1 on Zero-Shot Counting on FSC147
1 code implementation • 8 Oct 2020 • Liliane Momeni, Gül Varol, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman
The focus of this work is sign spotting - given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.
1 code implementation • ICCV 2023 • Chuhan Zhang, Ankush Gupta, Andrew Zisserman
We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i. e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e. g. Episodic Memory in Ego4D).
1 code implementation • ICML 2020 • Leonard Berrada, Andrew Zisserman, M. Pawan Kumar
In modern supervised learning, many deep neural networks are able to interpolate the data: the empirical loss can be driven to near zero on all samples simultaneously.
1 code implementation • 5 Jul 2022 • Junyu Xie, Weidi Xie, Andrew Zisserman
The objective of this paper is a model that is able to discover, track and segment multiple moving objects in a video.
Ranked #3 on Unsupervised Object Segmentation on FBMS-59
2 code implementations • 23 Oct 2018 • Yujie Zhong, Relja Arandjelović, Andrew Zisserman
The objective of this paper is to learn a compact representation of image sets for template-based face recognition.
Ranked #3 on Face Verification on IJB-A
2 code implementations • 29 Apr 2020 • Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman
Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques.
1 code implementation • 1 Nov 2021 • Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen
We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance.
1 code implementation • 20 Feb 2023 • Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman
This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022.
1 code implementation • 6 Jan 2021 • Manuel J. Marin-Jimenez, Vicky Kalogeiton, Pablo Medina-Suarez, Andrew Zisserman
For this purpose, we propose LAEO-Net++, a new deep CNN for determining LAEO in videos.
1 code implementation • 21 Aug 2023 • Ragav Sachdeva, Andrew Zisserman
The goal of this paper is to detect what has changed, if anything, between two "in the wild" images of the same 3D scene acquired from different camera positions and at different temporal instances.
1 code implementation • 27 May 2019 • Relja Arandjelović, Andrew Zisserman
We tackle the problem of object discovery, where objects are segmented for a given input image, and the system is trained without using any direct supervision whatsoever.
1 code implementation • 14 Jul 2021 • Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman
This paper explores the use of self-supervised deep learning in medical imaging in cases where two scan modalities are available for the same subject.
1 code implementation • 21 May 2019 • Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Kai Han, Andrea Vedaldi, Andrew Zisserman
The first is a simple but effective one: we leverage the power of transfer learning among different tasks and self-supervision to initialize a good representation of the data without making use of any label.
1 code implementation • CVPR 2019 • Manuel J. Marin-Jimenez, Vicky Kalogeiton, Pablo Medina-Suarez, Andrew Zisserman
For this purpose, we propose LAEO-Net, a new deep CNN for determining LAEO in videos.
1 code implementation • 29 Oct 2021 • K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman
In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting.
Ranked #1 on Visual Keyword Spotting on LRS2
1 code implementation • 14 May 2014 • Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
In particular, we show that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost.
1 code implementation • CVPR 2019 • Jean-Baptiste Alayrac, João Carreira, Andrew Zisserman
True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain.
1 code implementation • 16 Nov 2022 • K R Prajwal, Hannah Bull, Liliane Momeni, Samuel Albanie, Gül Varol, Andrew Zisserman
Through extensive evaluations, we verify our method for automatic annotation and our model architecture.
1 code implementation • 8 Oct 2023 • Sindhu B Hegde, Andrew Zisserman
In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person's gestures are correlated with their speech or not.
1 code implementation • ECCV 2018 • Arsha Nagrani, Samuel Albanie, Andrew Zisserman
We propose and investigate an identity sensitive joint embedding of face and voice.
2 code implementations • 22 Feb 2022 • Joao Carreira, Skanda Koppula, Daniel Zoran, Adria Recasens, Catalin Ionescu, Olivier Henaff, Evan Shelhamer, Relja Arandjelovic, Matt Botvinick, Oriol Vinyals, Karen Simonyan, Andrew Zisserman, Andrew Jaegle
This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video.
1 code implementation • 17 May 2022 • Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
Our goal in this paper is the adaptation of image-text models for long video retrieval.
Ranked #4 on Zero-Shot Action Recognition on Charades
1 code implementation • ICCV 2015 • Tomas Pfister, James Charles, Andrew Zisserman
The objective of this work is human pose estimation in videos, where multiple frames are available.
1 code implementation • 10 Oct 2023 • Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman
(iii) We find that features from Stable Diffusion are good for discriminative learning of a number of properties, including scene geometry, support relations, shadows and depth, but less performant for occlusion and material.
1 code implementation • 24 Jul 2018 • Weidi Xie, Andrew Zisserman
In this paper, we design a neural network architecture that learns to aggregate based on both "visual" quality (resolution, illumination), and "content" quality (relative importance for discriminative classification).
Ranked #5 on Face Verification on IJB-C (TAR @ FAR=1e-2 metric)
1 code implementation • 16 Sep 2020 • Erika Lu, Forrester Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, David Salesin, William T. Freeman, Michael Rubinstein
We present a method for retiming people in an ordinary, natural video -- manipulating and editing the time in which different motions of individuals in the video occur.
no code implementations • 15 Jun 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
The goal of this paper is to develop state-of-the-art models for lip reading -- visual speech recognition.
no code implementations • ECCV 2018 • Joao Carreira, Viorica Patraucean, Laurent Mazare, Andrew Zisserman, Simon Osindero
We introduce a class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles.
no code implementations • 11 Apr 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos.
no code implementations • CVPR 2018 • Arsha Nagrani, Samuel Albanie, Andrew Zisserman
We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching, (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available), and (iii) we use human testing as a baseline to calibrate the difficulty of the task.
no code implementations • 10 Mar 2018 • Simon Schmitt, Jonathan J. Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M. Czarnecki, Joel Z. Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, S. M. Ali Eslami
Our method places no constraints on the architecture of the teacher or student agents, and it regulates itself to allow the students to surpass their teachers in performance.
no code implementations • 31 Jan 2018 • Arsha Nagrani, Andrew Zisserman
The goal of this paper is the automatic identification of characters in TV and feature film material.
no code implementations • CVPR 2018 • Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, Andrew Zisserman
In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video.
no code implementations • ECCV 2018 • Relja Arandjelović, Andrew Zisserman
We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e. g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale on how to avoid undesirable shortcuts in the data preparation.
no code implementations • 20 Dec 2016 • David F. Fouhey, Abhinav Gupta, Andrew Zisserman
Our first objective is to infer these 3D shape attributes from a single image.
no code implementations • 21 Nov 2017 • Olivia Wiles, Andrew Zisserman
The objective of this paper is 3D shape understanding from single and multiple images.
no code implementations • ICCV 2017 • Carl Doersch, Andrew Zisserman
We investigate methods for combining multiple self-supervised tasks--i. e., supervised tasks where data can be collected without manual labeling--in order to train a single visual representation.
no code implementations • 10 May 2016 • Vasileios Belagiannis, Andrew Zisserman
We propose a novel ConvNet model for predicting 2D human body poses in an image.
no code implementations • 1 Aug 2017 • Amir Jamaludin, Timor Kadir, Andrew Zisserman
We show that the performance of the pre-trained CNN on the supervised classification task is (i) superior to that of a network trained from scratch; and (ii) requires far fewer annotated training samples to reach an equivalent performance to that of the network trained from scratch.
no code implementations • 3 Jul 2017 • Weilin Huang, Christopher P. Bridge, J. Alison Noble, Andrew Zisserman
We present an automatic method to describe clinically useful information about scanning, and to guide image interpretation in ultrasound (US) videos of the fetal heart.
1 code implementation • CVPR 2017 • Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
Ranked #4 on Lipreading on GRID corpus (mixed-speech) (using extra training data)
no code implementations • 24 Nov 2016 • Naila Murray, Hervé Jégou, Florent Perronnin, Andrew Zisserman
The second one involves equalising the match of a single descriptor to the aggregated vector.
no code implementations • 6 Aug 2016 • Joon Son Chung, Andrew Zisserman
The goal of this work is to recognise and localise short temporal signals in image time series, where strong supervision is not available for training.
no code implementations • CVPR 2016 • James Charles, Tomas Pfister, Derek Magee, David Hogg, Andrew Zisserman
The outcome is a substantial improvement in the pose estimates for the target video using the personalized ConvNet compared to the original generic ConvNet.
no code implementations • 12 Mar 2016 • Nate Crosswhite, Jeffrey Byrne, Omkar M. Parkhi, Chris Stauffer, Qiong Cao, Andrew Zisserman
Face recognition performance evaluation has traditionally focused on one-to-one verification, popularized by the Labeled Faces in the Wild dataset for imagery and the YouTubeFaces dataset for videos.
Ranked #8 on Face Verification on IJB-A
no code implementations • 20 Dec 2014 • Sobhan Naderi Parizi, Andrea Vedaldi, Andrew Zisserman, Pedro Felzenszwalb
First, a collection of informative parts is discovered, using heuristics that promote part distinctiveness and diversity, and then classifiers are trained on the vector of part responses.
no code implementations • 18 Dec 2014 • Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
We develop a representation suitable for the unconstrained recognition of words in natural images: the general case of no fixed lexicon and unknown length.
no code implementations • 4 Dec 2014 • Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
In this work we present an end-to-end system for text spotting -- localising and recognising text in natural scene images -- and text based image retrieval.
Ranked #15 on Scene Text Detection on ICDAR 2013
no code implementations • 17 Jul 2014 • Ken Chatfield, Karen Simonyan, Andrew Zisserman
We investigate the gains in precision and speed, that can be obtained by using Convolutional Networks (ConvNets) for on-the-fly retrieval - where classifiers are learnt at run time for a textual query from downloaded images, and used to rank large image or video datasets.
no code implementations • 15 May 2014 • Max Jaderberg, Andrea Vedaldi, Andrew Zisserman
The focus of this paper is speeding up the evaluation of convolutional neural networks.
no code implementations • 21 Jul 2018 • Ankush Gupta, Andrea Vedaldi, Andrew Zisserman
End-to-end trained Recurrent Neural Networks (RNNs) have been successfully applied to numerous problems that require processing sequences, such as image captioning, machine translation, and text recognition.
no code implementations • 26 Jul 2018 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman
We introduce a simple baseline for action localization on the AVA dataset.
Ranked #12 on Action Recognition on AVA v2.1
no code implementations • 27 Jul 2018 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman
The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e. g. audio).
no code implementations • ECCV 2018 • Weidi Xie, Li Shen, Andrew Zisserman
Our contributions are: (i) We propose a Deep Comparator Network (DCN) that can ingest a pair of sets (each may contain a variable number of images) as inputs, and compute a similarity between the pair--this involves attending to multiple discriminative local regions (landmarks), and comparing local descriptors between pairs of faces; (ii) To encourage high-quality representations for each set, internal competition is introduced for recalibration based on the landmark score; (iii) Inspired by image retrieval, a novel hard sample mining regime is proposed to control the sampling process, such that the DCN is complementary to the standard image classification models.
no code implementations • 16 Aug 2018 • Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets.
Ranked #3 on Facial Expression Recognition (FER) on FERPlus
Facial Emotion Recognition Facial Expression Recognition (FER) +1
no code implementations • 3 Sep 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition.
no code implementations • 6 Sep 2018 • Olivia Wiles, Andrew Zisserman
Finally, we demonstrate that we can indeed obtain a depth map of a novel object from a single image for a variety of sculptures with varying shape/texture, and that the network generalises at test time to new domains (e. g. synthetic images).
no code implementations • 6 Sep 2018 • Mohsan Alvi, Andrew Zisserman, Christoffer Nellaker
We demonstrate on this dataset, for a number of facial attribute classification tasks, that we are able to remove racial biases from the network feature representation.
no code implementations • 17 Sep 2018 • Mitchell Dawson, Andrew Zisserman, Christoffer Nellåker
In the instance of data sets for visual kinship verification, one such unintended signal could be that the faces are cropped from the same photograph, since faces from the same photograph are more likely to be from the same family.
no code implementations • 23 Sep 2018 • Ankush Gupta, Andrea Vedaldi, Andrew Zisserman
This work presents a method for visual text recognition without using any paired supervisory data.
no code implementations • CVPR 2019 • Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman
We introduce the Action Transformer model for recognizing and localizing human actions in video clips.
Ranked #6 on Action Recognition on AVA v2.1
no code implementations • NeurIPS 2013 • Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
As massively parallel computations have become broadly available with modern GPUs, deep architectures trained on very large datasets have risen in popularity.
no code implementations • NeurIPS 2011 • Victor Lempitsky, Andrea Vedaldi, Andrew Zisserman
Often, the random field is applied over a flat partitioning of the image into non-intersecting elements, such as pixels or super-pixels.
no code implementations • NeurIPS 2010 • Matthew Blaschko, Andrea Vedaldi, Andrew Zisserman
A standard approach to learning object category detectors is to provide strong supervision in the form of a region of interest (ROI) specifying each instance of the object in the training images.
no code implementations • NeurIPS 2010 • Victor Lempitsky, Andrew Zisserman
Learning to infer such density can be formulated as a minimization of a regularized risk quadratic cost function.
no code implementations • NeurIPS 2009 • Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman
In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes.
no code implementations • NeurIPS 2009 • Andrea Vedaldi, Andrew Zisserman
We develop a structured output model for object category detection that explicitly accounts for alignment, multiple aspects and partial truncation in both training and inference.
no code implementations • NeurIPS 2008 • Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, Francis R. Bach
It is now well established that sparse signal models are well suited to restoration tasks and can effectively be learned from audio, image, and video data.
no code implementations • CVPR 2018 • Donglai Wei, Joseph J. Lim, Andrew Zisserman, William T. Freeman
We seek to understand the arrow of time in videos -- what makes videos look like they are playing forwards or backwards?
Ranked #50 on Self-Supervised Action Recognition on UCF101
Self-Supervised Action Recognition Temporal Action Localization +1
no code implementations • ECCV 2018 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman
The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e. g. audio).
no code implementations • CVPR 2013 • Carlos Arteta, Victor Lempitsky, J. A. Noble, Andrew Zisserman
For example, our detector can pick a region containing two or three object instances, while assigning such region an appropriate label.
no code implementations • CVPR 2013 • Relja Arandjelovic, Andrew Zisserman
The objective of this paper is large scale object instance retrieval, given a query image.
no code implementations • CVPR 2013 • Lubor Ladicky, Philip H. S. Torr, Andrew Zisserman
Our goal is to detect humans and estimate their 2D pose in single images.
no code implementations • CVPR 2013 • Mayank Juneja, Andrea Vedaldi, C. V. Jawahar, Andrew Zisserman
The automatic discovery of distinctive parts for an object or scene class is challenging since it requires simultaneously to learn the part appearance and also to identify the part occurrences in images.
no code implementations • CVPR 2013 • Minh Hoai, Andrew Zisserman
The objective of this work is to learn sub-categories.
no code implementations • CVPR 2014 • Minh Hoai, Andrew Zisserman
The objective of this work is to accurately and efficiently detect configurations of one or more people in edited TV material.
no code implementations • CVPR 2014 • Omkar M. Parkhi, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
Our goal is to learn a compact, discriminative vector representation of a face track, suitable for the face recognition tasks of verification and classification.
no code implementations • CVPR 2014 • Lyndsey C. Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Chang-Shui Zhang, Andrew Zisserman, Bernhard Scholkopf, William T. Freeman
We explore whether we can observe Time's Arrow in a temporal sequence--is it possible to tell whether a video is running forwards or backwards?
no code implementations • CVPR 2014 • Yusuf Aytar, Andrew Zisserman
The objective of this work is object category detection in large scale image datasets in the manner of Video Google an object category is specified by a HOG classifier template, and retrieval is immediate at run time.
no code implementations • CVPR 2014 • Herve Jegou, Andrew Zisserman
We consider the design of a single vector representation for an image that embeds and aggregates a set of local patch descriptors such as SIFT.
no code implementations • CVPR 2016 • David F. Fouhey, Abhinav Gupta, Andrew Zisserman
In this paper we investigate 3D attributes as a means to understand the shape of an object in a single image.
no code implementations • NeurIPS 2019 • Carl Doersch, Andrew Zisserman
In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person's motion, notably as optical flow and the motion of 2D keypoints.
no code implementations • 11 Jul 2019 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice.
no code implementations • 15 Jul 2019 • Joao Carreira, Eric Noland, Chloe Hillier, Andrew Zisserman
We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos.
no code implementations • 14 Aug 2019 • Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman
We propose AutoCorrect, a method to automatically learn object-annotation alignments from a dataset with annotations affected by geometric noise.
no code implementations • 6 Sep 2019 • Dan Xu, Weidi Xie, Andrew Zisserman
In this paper we propose a geometry-aware model for video object detection.
no code implementations • 19 Sep 2019 • Max Bain, Arsha Nagrani, Daniel Schofield, Andrew Zisserman
The goal of this paper is to label all the animal individuals present in every frame of a video.
no code implementations • ICCV 2019 • Jean-Baptiste Alayrac, João Carreira, Relja Arandjelović, Andrew Zisserman
The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to.
no code implementations • 28 Oct 2019 • Olivia Wiles, A. Sophia Koepke, Andrew Zisserman
This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information.
no code implementations • 28 Nov 2019 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data.
Ranked #13 on Lipreading on LRS3-TED (using extra training data)
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 5 Dec 2019 • Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman
The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.
no code implementations • 20 Feb 2020 • Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman
The objective of this paper is to learn representations of speaker identity without access to manually annotated data.
no code implementations • 26 Mar 2020 • Yujie Zhong, Relja Arandjelović, Andrew Zisserman
The objective of this work is to learn a compact embedding of a set of descriptors that is suitable for efficient retrieval and ranking, whilst maintaining discriminability of the individual descriptors.
no code implementations • CVPR 2020 • Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.
no code implementations • 13 Apr 2020 • Robert McCraith, Lukas Neumann, Andrew Zisserman, Andrea Vedaldi
Recent advances in self-supervised learning havedemonstrated that it is possible to learn accurate monoculardepth reconstruction from raw video data, without using any 3Dground truth for supervision.
no code implementations • 1 May 2020 • Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, Andrew Zisserman
The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips.
no code implementations • CVPR 2020 • Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
We present an approach for estimating the period with which an action is repeated in a video.
no code implementations • 2 Jul 2020 • Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman
Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.
no code implementations • 6 Jul 2020 • Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman
We propose a novel convolutional method for the detection and identification of vertebrae in whole spine MRIs.
no code implementations • CVPR 2021 • Olivia Wiles, Sebastien Ehrhardt, Andrew Zisserman
We propose a new approach to determine correspondences between image pairs in the wild under large changes in illumination, viewpoint, context, and material.
no code implementations • 1 Sep 2020 • Weidi Xie, Jeffrey Byrne, Andrew Zisserman
We describe three use cases on the public IJB-C face verification benchmark: (i) to improve 1:1 image-based verification error rates by rejecting low-quality face images; (ii) to improve quality score based fusion performance on the 1:1 set-based verification benchmark; and (iii) its use as a quality measure for selecting high quality (unblurred, good lighting, more frontal) faces from a collection, e. g. for automatic enrolment or display.
no code implementations • ECCV 2020 • Yang Liu, Qingchao Chen, Andrew Zisserman
In this paper we introduce two methods to amplify key cues in the image, and also a method to combine these and other cues when considering the interaction between a human and an object.
no code implementations • ECCV 2020 • Chuhan Zhang, Ankush Gupta, Andrew Zisserman
In this work, our objective is to address the problems of generalization and flexibility for text recognition in documents.
no code implementations • 21 Oct 2020 • Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, Andrew Zisserman
We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset.
no code implementations • 23 Nov 2020 • Hala Lamdouar, Charig Yang, Weidi Xie, Andrew Zisserman
We make the following three contributions: (i) We propose a novel architecture that consists of two essential components for breaking camouflage, namely, a differentiable registration module to align consecutive frames based on the background, which effectively emphasises the object boundary in the difference image, and a motion segmentation module with memory that discovers the moving objects, while maintaining the object permanence even when motion is absent at some point.
no code implementations • 12 Dec 2020 • Arsha Nagrani, Joon Son Chung, Jaesung Huh, Andrew Brown, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman
We held the second installment of the VoxCeleb Speaker Recognition Challenge in conjunction with Interspeech 2020.
no code implementations • 10 Feb 2021 • Andrew Brown, Ernesto Coto, Andrew Zisserman
We present a method for automatically labelling all faces in video archives, such as TV broadcasts, by combining multiple evidence sources and multiple modalities (visual and audio).
no code implementations • 1 Dec 2008 • Maria-Elena Nilsback, Andrew Zisserman
We investigate to what extent combinations of features can improve classification performance on a large dataset of similar classes.
no code implementations • CVPR 2021 • Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
We also extend our method to the video domain, improving the state of the art on the VATEX dataset.
no code implementations • CVPR 2021 • Gül Varol, Liliane Momeni, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman
Our contributions are as follows: (1) we demonstrate the ability to leverage large quantities of continuous signing videos with weakly-aligned subtitles to localise signs in continuous sign language; (2) we employ the learned attention to automatically generate hundreds of thousands of annotations for a large sign vocabulary; (3) we collect a set of 37K manually verified sign instances across a vocabulary of 950 sign classes to support our study of sign language recognition; (4) by training on the newly annotated data from our method, we outperform the prior state of the art on the BSL-1K sign language recognition benchmark.
no code implementations • ICCV 2021 • Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, Weidi Xie
We additionally evaluate on a challenging camouflage dataset (MoCA), significantly outperforming the other self-supervised approaches, and comparing favourably to the top supervised approach, highlighting the importance of motion cues, and the potential bias towards visual appearance in existing video segmentation models.
Ranked #7 on Unsupervised Object Segmentation on DAVIS 2016
no code implementations • CVPR 2021 • Chuhan Zhang, Ankush Gupta, Andrew Zisserman
It attends to relevant segments for each query with a temporal attention mechanism, and can be trained using only the labels for each query.
no code implementations • ICCV 2021 • Hannah Bull, Triantafyllos Afouras, Gül Varol, Samuel Albanie, Liliane Momeni, Andrew Zisserman
The goal of this work is to temporally align asynchronous subtitles in sign language videos.
no code implementations • CVPR 2021 • Erika Lu, Forrester Cole, Tali Dekel, Andrew Zisserman, William T. Freeman, Michael Rubinstein
We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semi-transparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject.
no code implementations • 20 May 2021 • Andrew Brown, Vicky Kalogeiton, Andrew Zisserman
In this paper we make contributions to address both these deficiencies: first, we introduce a Multi-Modal High-Precision Clustering algorithm for person-clustering in videos using cues from several modalities (face, body, and voice).