Search Results for author: Andrew Zisserman

Found 187 papers, 79 papers with code

Amplifying Key Cues for Human-Object-Interaction Detection

no code implementations ECCV 2020 Yang Liu, Qingchao Chen, Andrew Zisserman

In this paper we introduce two methods to amplify key cues in the image, and also a method to combine these and other cues when considering the interaction between a human and an object.

Human-Object Interaction Detection

A CLIP-Hitchhiker's Guide to Long Video Retrieval

no code implementations17 May 2022 Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman

Our goal in this paper is the adaptation of image-text models for long video retrieval.

Frame Video Retrieval

Scaling up sign spotting through sign language dictionaries

no code implementations9 May 2022 Gül Varol, Liliane Momeni, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman

The focus of this work is $\textit{sign spotting}$ - given a video of an isolated sign, our task is to identify $\textit{whether}$ and $\textit{where}$ it has been signed in a continuous, co-articulated sign language video.

Multiple Instance Learning

SpineNetV2: Automated Detection, Labelling and Radiological Grading Of Clinical MR Scans

no code implementations3 May 2022 Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman

This technical report presents SpineNetV2, an automated tool which: (i) detects and labels vertebral bodies in clinical spinal magnetic resonance (MR) scans across a range of commonly used sequences; and (ii) performs radiological grading of lumbar intervertebral discs in T2-weighted scans for a range of common degenerative changes.

Temporal Alignment Networks for Long-term Video

no code implementations6 Apr 2022 Tengda Han, Weidi Xie, Andrew Zisserman

The objective of this paper is a temporal alignment network that ingests long term video sequences, and associated text sentences, in order to: (1) determine if a sentence is alignable with the video; and (2) if it is alignable, then determine its alignment.

Action Recognition Action Segmentation +2

Hierarchical Perceiver

no code implementations22 Feb 2022 Joao Carreira, Skanda Koppula, Daniel Zoran, Adria Recasens, Catalin Ionescu, Olivier Henaff, Evan Shelhamer, Relja Arandjelovic, Matt Botvinick, Oriol Vinyals, Karen Simonyan, Andrew Zisserman, Andrew Jaegle

General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs.

Persistent Object Identification Leveraging Non-Visual Markers

1 code implementation13 Dec 2021 Michael P. J. Camilleri, Li Zhang, Rasneer S. Bains, Andrew Zisserman, Christopher K. I. Williams

Our objective is to locate and provide a unique identifier for each mouse in a cluttered home-cage environment through time, as a precursor to automated behaviour recognition for biological research.

Visual Tracking

Label, Verify, Correct: A Simple Few Shot Object Detection Method

no code implementations10 Dec 2021 Prannay Kaul, Weidi Xie, Andrew Zisserman

The objective of this paper is few-shot object detection (FSOD) -- the task of expanding an object detector for a new category given only a few instances for training.

Few-Shot Object Detection

Audio-Visual Synchronisation in the wild

no code implementations8 Dec 2021 Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.

Lip Reading

Input-level Inductive Biases for 3D Reconstruction

no code implementations6 Dec 2021 Wang Yifan, Carl Doersch, Relja Arandjelović, João Carreira, Andrew Zisserman

Much of the recent progress in 3D vision has been driven by the development of specialized architectures that incorporate geometrical inductive biases.

3D Reconstruction Depth Estimation

It's About Time: Analog Clock Reading in the Wild

no code implementations17 Nov 2021 Charig Yang, Weidi Xie, Andrew Zisserman

In this paper, we present a framework for reading analog clocks in natural images or videos.

BBC-Oxford British Sign Language Dataset

no code implementations5 Nov 2021 Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland, Andrew Zisserman

In this work, we introduce the BBC-Oxford British Sign Language (BOBSL) dataset, a large-scale video collection of British Sign Language (BSL).

Sign Language Translation Translation

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

1 code implementation1 Nov 2021 Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen

We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance.

Action Recognition Language Modelling

Visual Keyword Spotting with Attention

1 code implementation29 Oct 2021 K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman

In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting.

Lip Reading Visual Keyword Spotting

Sub-word Level Lip Reading With Visual Attention

no code implementations14 Oct 2021 K R Prajwal, Triantafyllos Afouras, Andrew Zisserman

To this end, we make the following contributions: (1) we propose an attention-based pooling mechanism to aggregate visual speech representations; (2) we use sub-word units for lip reading for the first time and show that this allows us to better model the ambiguities of the task; (3) we propose a model for Visual Speech Detection (VSD), trained on top of the lip reading network.

 Ranked #1 on Lipreading on LRS2 (using extra training data)

Audio-Visual Active Speaker Detection Automatic Speech Recognition +3

Open-Set Recognition: a Good Closed-Set Classifier is All You Need?

1 code implementation ICLR 2022 Sagar Vaze, Kai Han, Andrea Vedaldi, Andrew Zisserman

In this paper, we first demonstrate that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes.

Open Set Learning Out-of-Distribution Detection

PASS: An ImageNet replacement for self-supervised pretraining without humans

1 code implementation NeurIPS Workshop ImageNet_PPF 2021 Yuki M. Asano, Christian Rupprecht, Andrew Zisserman, Andrea Vedaldi

On the other hand, state-of-the-art pretraining is nowadays obtained with unsupervised methods, meaning that labelled datasets such as ImageNet may not be necessary, or perhaps not even optimal, for model pretraining.

Pose Estimation Transfer Learning

Self-Supervised Multi-Modal Alignment for Whole Body Medical Imaging

1 code implementation14 Jul 2021 Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman

This paper explores the use of self-supervised deep learning in medical imaging in cases where two scan modalities are available for the same subject.

AutoNovel: Automatically Discovering and Learning Novel Visual Categories

no code implementations29 Jun 2021 Kai Han, Sylvestre-Alvise Rebuffi, Sébastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman

We present a new approach called AutoNovel to address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labelled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use ranking statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data.

Image Clustering Self-Supervised Learning

NeRF in detail: Learning to sample for view synthesis

no code implementations9 Jun 2021 Relja Arandjelović, Andrew Zisserman

In this work we address a clear limitation of the vanilla coarse-to-fine approach -- that it is based on a heuristic and not trained end-to-end for the task at hand.

Novel View Synthesis

Face, Body, Voice: Video Person-Clustering with Multiple Modalities

no code implementations20 May 2021 Andrew Brown, Vicky Kalogeiton, Andrew Zisserman

In this paper we make contributions to address both these deficiencies: first, we introduce a Multi-Modal High-Precision Clustering algorithm for person-clustering in videos using cues from several modalities (face, body, and voice).

Face Clustering

Comment on Stochastic Polyak Step-Size: Performance of ALI-G

no code implementations20 May 2021 Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

This is a short note on the performance of the ALI-G algorithm (Berrada et al., 2020) as reported in (Loizou et al., 2021).

Omnimatte: Associating Objects and Their Effects in Video

no code implementations CVPR 2021 Erika Lu, Forrester Cole, Tali Dekel, Andrew Zisserman, William T. Freeman, Michael Rubinstein

We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semi-transparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject.

Temporal Query Networks for Fine-grained Video Understanding

no code implementations CVPR 2021 Chuhan Zhang, Ankush Gupta, Andrew Zisserman

It attends to relevant segments for each query with a temporal attention mechanism, and can be trained using only the labels for each query.

Action Classification Video Understanding

TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval

1 code implementation ICCV 2021 Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Samuel Albanie, Yang Liu

In recent years, considerable progress on the task of text-video retrieval has been achieved by leveraging large-scale pretraining on visual and audio datasets to construct powerful video encoders.

Video Retrieval

Self-supervised Video Object Segmentation by Motion Grouping

no code implementations ICCV 2021 Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, Weidi Xie

We additionally evaluate on a challenging camouflage dataset (MoCA), significantly outperforming the other self-supervised approaches, and comparing favourably to the top supervised approach, highlighting the importance of motion cues, and the potential bias towards visual appearance in existing video segmentation models.

Motion Segmentation Optical Flow Estimation +4

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

4 code implementations ICCV 2021 Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

Ranked #6 on Video Retrieval on DiDeMo (using extra training data)

Text to Video Retrieval Video Captioning +1

Read and Attend: Temporal Localisation in Sign Language Videos

no code implementations CVPR 2021 Gül Varol, Liliane Momeni, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman

Our contributions are as follows: (1) we demonstrate the ability to leverage large quantities of continuous signing videos with weakly-aligned subtitles to localise signs in continuous sign language; (2) we employ the learned attention to automatically generate hundreds of thousands of annotations for a large sign vocabulary; (3) we collect a set of 37K manually verified sign instances across a vocabulary of 950 sign classes to support our study of sign language recognition; (4) by training on the newly annotated data from our method, we outperform the prior state of the art on the BSL-1K sign language recognition benchmark.

Sign Language Recognition

Slow-Fast Auditory Streams For Audio Recognition

2 code implementations5 Mar 2021 Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen

We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs.

Audio Classification

Perceiver: General Perception with Iterative Attention

9 code implementations4 Mar 2021 Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira

The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models.

3D Point Cloud Classification Audio Classification +1

Automated Video Labelling: Identifying Faces by Corroborative Evidence

no code implementations10 Feb 2021 Andrew Brown, Ernesto Coto, Andrew Zisserman

We present a method for automatically labelling all faces in video archives, such as TV broadcasts, by combining multiple evidence sources and multiple modalities (visual and audio).

Domain Adaptation Image Retrieval

Betrayed by Motion: Camouflaged Object Discovery via Motion Segmentation

no code implementations23 Nov 2020 Hala Lamdouar, Charig Yang, Weidi Xie, Andrew Zisserman

We make the following three contributions: (i) We propose a novel architecture that consists of two essential components for breaking camouflage, namely, a differentiable registration module to align consecutive frames based on the background, which effectively emphasises the object boundary in the difference image, and a motion segmentation module with memory that discovers the moving objects, while maintaining the object permanence even when motion is absent at some point.

Motion Segmentation Object Discovery +1

A Short Note on the Kinetics-700-2020 Human Action Dataset

no code implementations21 Oct 2020 Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, Andrew Zisserman

We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset.

Watch, read and lookup: learning to spot signs from multiple supervisors

1 code implementation8 Oct 2020 Liliane Momeni, Gül Varol, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman

The focus of this work is sign spotting - given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.

Multiple Instance Learning

Layered Neural Rendering for Retiming People in Video

1 code implementation16 Sep 2020 Erika Lu, Forrester Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, David Salesin, William T. Freeman, Michael Rubinstein

We present a method for retiming people in an ordinary, natural video -- manipulating and editing the time in which different motions of individuals in the video occur.

Frame Neural Rendering

Adaptive Text Recognition through Visual Matching

no code implementations ECCV 2020 Chuhan Zhang, Ankush Gupta, Andrew Zisserman

In this work, our objective is to address the problems of generalization and flexibility for text recognition in documents.

Representation Learning

Seeing wake words: Audio-visual Keyword Spotting

1 code implementation2 Sep 2020 Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, Andrew Zisserman

The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio.

Lip Reading Visual Keyword Spotting

Inducing Predictive Uncertainty Estimation for Face Recognition

no code implementations1 Sep 2020 Weidi Xie, Jeffrey Byrne, Andrew Zisserman

We describe three use cases on the public IJB-C face verification benchmark: (i) to improve 1:1 image-based verification error rates by rejecting low-quality face images; (ii) to improve quality score based fusion performance on the 1:1 set-based verification benchmark; and (iii) its use as a quality measure for selecting high quality (unblurred, good lighting, more frontal) faces from a collection, e. g. for automatic enrolment or display.

Face Recognition Face Verification

RareAct: A video dataset of unusual interactions

1 code implementation3 Aug 2020 Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes".

Action Recognition

Memory-augmented Dense Predictive Coding for Video Representation Learning

1 code implementation ECCV 2020 Tengda Han, Weidi Xie, Andrew Zisserman

The objective of this paper is self-supervised learning from video, in particular for representations for action recognition.

Action Classification Action Recognition +4

Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval

2 code implementations ECCV 2020 Andrew Brown, Weidi Xie, Vicky Kalogeiton, Andrew Zisserman

Optimising a ranking-based metric, such as Average Precision (AP), is notoriously challenging due to the fact that it is non-differentiable, and hence cannot be optimised directly using gradient-descent methods.

Image Instance Retrieval Metric Learning +1

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

1 code implementation ECCV 2020 Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, Andrew Zisserman

Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality.

Action Classification Keyword Spotting +2

CrossTransformers: spatially-aware few-shot transfer

2 code implementations NeurIPS 2020 Carl Doersch, Ankush Gupta, Andrew Zisserman

In this work, we illustrate how the neural network representations which underpin modern vision systems are subject to supervision collapse, whereby they lose any information that is not necessary for performing the training task, including information that may be necessary for transfer to new tasks or domains.

Self-Supervised Learning

Co-Attention for Conditioned Image Matching

no code implementations CVPR 2021 Olivia Wiles, Sebastien Ehrhardt, Andrew Zisserman

We propose a new approach to determine correspondences between image pairs in the wild under large changes in illumination, viewpoint, context, and material.

3D Reconstruction Camera Localization +2

Spot the conversation: speaker diarisation in the wild

no code implementations2 Jul 2020 Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman

Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.

Speaker Verification

Self-Supervised MultiModal Versatile Networks

1 code implementation NeurIPS 2020 Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.

Action Recognition In Videos Audio Classification +2

The AVA-Kinetics Localized Human Actions Video Dataset

no code implementations1 May 2020 Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, Andrew Zisserman

The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips.

Action Classification

VGGSound: A Large-scale Audio-Visual Dataset

1 code implementation29 Apr 2020 Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman

Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques.

Image Classification

Monocular Depth Estimation with Self-supervised Instance Adaptation

no code implementations13 Apr 2020 Robert McCraith, Lukas Neumann, Andrew Zisserman, Andrea Vedaldi

Recent advances in self-supervised learning havedemonstrated that it is possible to learn accurate monoculardepth reconstruction from raw video data, without using any 3Dground truth for supervision.

Monocular Depth Estimation Self-Supervised Learning

Speech2Action: Cross-modal Supervision for Action Recognition

no code implementations CVPR 2020 Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.

Action Recognition

Compact Deep Aggregation for Set Retrieval

no code implementations26 Mar 2020 Yujie Zhong, Relja Arandjelović, Andrew Zisserman

The objective of this work is to learn a compact embedding of a set of descriptors that is suitable for efficient retrieval and ranking, whilst maintaining discriminability of the individual descriptors.

Visual Grounding in Video for Unsupervised Word Translation

1 code implementation CVPR 2020 Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman

Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages.

Translation Visual Grounding +1

Disentangled Speech Embeddings using Cross-modal Self-supervision

no code implementations20 Feb 2020 Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman

The objective of this paper is to learn representations of speaker identity without access to manually annotated data.

Self-Supervised Learning Speaker Recognition

Automatically Discovering and Learning New Visual Categories with Ranking Statistics

1 code implementation ICLR 2020 Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman

In this work we address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data.

General Classification Self-Supervised Learning

Synthetic Humans for Action Recognition from Unseen Viewpoints

1 code implementation9 Dec 2019 Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman

Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored.

Action Classification Action Recognition +1

VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

no code implementations5 Dec 2019 Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A. Reynolds, Andrew Zisserman

The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data.

Speaker Recognition

ASR is all you need: cross-modal distillation for lip reading

no code implementations28 Nov 2019 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data.

Ranked #10 on Lipreading on LRS2 (using extra training data)

Automatic Speech Recognition Frame +3

Self-supervised learning of class embeddings from video

no code implementations28 Oct 2019 Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information.

Frame Self-Supervised Learning

Controllable Attention for Structured Layered Video Decomposition

no code implementations ICCV 2019 Jean-Baptiste Alayrac, João Carreira, Relja Arandjelović, Andrew Zisserman

The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to.

Action Recognition Reflection Removal

Count, Crop and Recognise: Fine-Grained Recognition in the Wild

no code implementations19 Sep 2019 Max Bain, Arsha Nagrani, Daniel Schofield, Andrew Zisserman

The goal of this paper is to label all the animal individuals present in every frame of a video.

Frame

Video Representation Learning by Dense Predictive Coding

1 code implementation10 Sep 2019 Tengda Han, Weidi Xie, Andrew Zisserman

The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition.

Representation Learning Self-Supervised Action Recognition +1

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

1 code implementation ICCV 2019 Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen

We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i. e. the combination of modalities within a range of temporal offsets.

Action Recognition Egocentric Activity Recognition

AutoCorrect: Deep Inductive Alignment of Noisy Geometric Annotations

no code implementations14 Aug 2019 Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman

We propose AutoCorrect, a method to automatically learn object-annotation alignments from a dataset with annotations affected by geometric noise.

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

3 code implementations31 Jul 2019 Yang Liu, Samuel Albanie, Arsha Nagrani, Andrew Zisserman

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge.

Video Retrieval

A Short Note on the Kinetics-700 Human Action Dataset

no code implementations15 Jul 2019 Joao Carreira, Eric Noland, Chloe Hillier, Andrew Zisserman

We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos.

Action Classification

My lips are concealed: Audio-visual speech enhancement through obstructions

no code implementations11 Jul 2019 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice.

Speech Enhancement

Sim2real transfer learning for 3D human pose estimation: motion to the rescue

no code implementations NeurIPS 2019 Carl Doersch, Andrew Zisserman

In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person's motion, notably as optical flow and the motion of 2D keypoints.

3D Human Pose Estimation 3D Pose Estimation +2

Training Neural Networks for and by Interpolation

1 code implementation ICML 2020 Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

In modern supervised learning, many deep neural networks are able to interpolate the data: the empirical loss can be driven to near zero on all samples simultaneously.

A Hierarchical Probabilistic U-Net for Modeling Multi-Scale Ambiguities

4 code implementations30 May 2019 Simon A. A. Kohl, Bernardino Romera-Paredes, Klaus H. Maier-Hein, Danilo Jimenez Rezende, S. M. Ali Eslami, Pushmeet Kohli, Andrew Zisserman, Olaf Ronneberger

Medical imaging only indirectly measures the molecular identity of the tissue within each voxel, which often produces only ambiguous image evidence for target measures of interest, like semantic segmentation.

Instance Segmentation Medical Image Segmentation +1

Object Discovery with a Copy-Pasting GAN

1 code implementation27 May 2019 Relja Arandjelović, Andrew Zisserman

We tackle the problem of object discovery, where objects are segmented for a given input image, and the system is trained without using any direct supervision whatsoever.

Object Discovery Unsupervised Object Segmentation

Semi-Supervised Learning with Scarce Annotations

1 code implementation21 May 2019 Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Kai Han, Andrea Vedaldi, Andrew Zisserman

The first is a simple but effective one: we leverage the power of transfer learning among different tasks and self-supervision to initialize a good representation of the data without making use of any label.

Multi-class Classification Self-Supervised Learning +1

A Geometric Approach to Obtain a Bird's Eye View from an Image

1 code implementation6 May 2019 Ammar Abbas, Andrew Zisserman

The objective of this paper is to rectify any monocular image by computing a homography matrix that transforms it to a bird's eye (overhead) view.

The VIA Annotation Software for Images, Audio and Video

1 code implementation24 Apr 2019 Abhishek Dutta, Andrew Zisserman

In this paper, we introduce a simple and standalone manual annotation tool for images, audio and video: the VGG Image Annotator (VIA).

Temporal Cycle-Consistency Learning

3 code implementations CVPR 2019 Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

We introduce a self-supervised representation learning method based on the task of temporal alignment between videos.

Anomaly Detection Frame +3

The StreetLearn Environment and Dataset

1 code implementation4 Mar 2019 Piotr Mirowski, Andras Banki-Horvath, Keith Anderson, Denis Teplyashin, Karl Moritz Hermann, Mateusz Malinowski, Matthew Koichi Grimes, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, Raia Hadsell

These datasets cannot be used for decision-making and reinforcement learning, however, and in general the perspective of navigation as an interactive learning task, where the actions and behaviours of a learning agent are learned simultaneously with the perception and planning, is relatively unsupported.

Decision Making

Utterance-level Aggregation For Speaker Recognition In The Wild

8 code implementations26 Feb 2019 Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman

The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals.

Frame Speaker Recognition +1

The Visual Centrifuge: Model-Free Layered Video Representations

1 code implementation CVPR 2019 Jean-Baptiste Alayrac, João Carreira, Andrew Zisserman

True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain.

Color Constancy Video Understanding

Deep Frank-Wolfe For Neural Network Optimization

1 code implementation ICLR 2019 Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

Furthermore, we compare our algorithm to SGD with a hand-designed learning rate schedule, and show that it provides similar generalization while converging faster.

Class-Agnostic Counting

1 code implementation1 Nov 2018 Erika Lu, Weidi Xie, Andrew Zisserman

The model achieves competitive performance on cell and crowd counting datasets, and surpasses the state-of-the-art on the car dataset using only three training images.

Crowd Counting Few-Shot Learning +1

GhostVLAD for set-based face recognition

3 code implementations23 Oct 2018 Yujie Zhong, Relja Arandjelović, Andrew Zisserman

The objective of this paper is to learn a compact representation of image sets for template-based face recognition.

Face Recognition Face Verification

Learning to Read by Spelling: Towards Unsupervised Text Recognition

no code implementations23 Sep 2018 Ankush Gupta, Andrea Vedaldi, Andrew Zisserman

This work presents a method for visual text recognition without using any paired supervisory data.

From Same Photo: Cheating on Visual Kinship Challenges

no code implementations17 Sep 2018 Mitchell Dawson, Andrew Zisserman, Christoffer Nellåker

In the instance of data sets for visual kinship verification, one such unintended signal could be that the faces are cropped from the same photograph, since faces from the same photograph are more likely to be from the same family.

3D Surface Reconstruction by Pointillism

no code implementations6 Sep 2018 Olivia Wiles, Andrew Zisserman

Finally, we demonstrate that we can indeed obtain a depth map of a novel object from a single image for a variety of sculptures with varying shape/texture, and that the network generalises at test time to new domains (e. g. synthetic images).

Surface Reconstruction

Turning a Blind Eye: Explicit Removal of Biases and Variation from Deep Neural Network Embeddings

no code implementations6 Sep 2018 Mohsan Alvi, Andrew Zisserman, Christoffer Nellaker

We demonstrate on this dataset, for a number of facial attribute classification tasks, that we are able to remove racial biases from the network feature representation.

Classification Facial Attribute Classification +2

Self-supervised learning of a facial attribute embedding from video

2 code implementations21 Aug 2018 Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time.

Frame Self-Supervised Learning +1

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

no code implementations16 Aug 2018 Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets.

Ranked #3 on Facial Expression Recognition on FERPlus (using extra training data)

Facial Emotion Recognition Facial Expression Recognition +1

A Short Note about Kinetics-600

1 code implementation3 Aug 2018 Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, Andrew Zisserman

We describe an extension of the DeepMind Kinetics human action dataset from 400 classes, each with at least 400 video clips, to 600 classes, each with at least 600 video clips.

Action Classification

Comparator Networks

no code implementations ECCV 2018 Weidi Xie, Li Shen, Andrew Zisserman

Our contributions are: (i) We propose a Deep Comparator Network (DCN) that can ingest a pair of sets (each may contain a variable number of images) as inputs, and compute a similarity between the pair--this involves attending to multiple discriminative local regions (landmarks), and comparing local descriptors between pairs of faces; (ii) To encourage high-quality representations for each set, internal competition is introduced for recalibration based on the landmark score; (iii) Inspired by image retrieval, a novel hard sample mining regime is proposed to control the sampling process, such that the DCN is complementary to the standard image classification models.

Face Recognition Image Classification +1

X2Face: A network for controlling face generation by using images, audio, and pose codes

no code implementations27 Jul 2018 Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e. g. audio).

Face Generation Frame

Multicolumn Networks for Face Recognition

1 code implementation24 Jul 2018 Weidi Xie, Andrew Zisserman

In this paper, we design a neural network architecture that learns to aggregate based on both "visual" quality (resolution, illumination), and "content" quality (relative importance for discriminative classification).

Ranked #5 on Face Verification on IJB-C (TAR @ FAR=1e-2 metric)

Face Recognition General Classification

Inductive Visual Localisation: Factorised Training for Superior Generalisation

no code implementations21 Jul 2018 Ankush Gupta, Andrea Vedaldi, Andrew Zisserman

End-to-end trained Recurrent Neural Networks (RNNs) have been successfully applied to numerous problems that require processing sequences, such as image captioning, machine translation, and text recognition.

Image Captioning Machine Translation +1

Deep Lip Reading: a comparison of models and an online application

no code implementations15 Jun 2018 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

The goal of this paper is to develop state-of-the-art models for lip reading -- visual speech recognition.

Lip Reading Visual Speech Recognition

VoxCeleb2: Deep Speaker Recognition

2 code implementations14 Jun 2018 Joon Son Chung, Arsha Nagrani, Andrew Zisserman

The objective of this paper is speaker recognition under noisy and unconstrained conditions.

 Ranked #1 on Speaker Verification on VoxCeleb2 (using extra training data)

Speaker Recognition Speaker Verification

Massively Parallel Video Networks

no code implementations ECCV 2018 Joao Carreira, Viorica Patraucean, Laurent Mazare, Andrew Zisserman, Simon Osindero

We introduce a class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles.

Action Recognition Frame +1

The Conversation: Deep Audio-Visual Speech Enhancement

no code implementations11 Apr 2018 Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos.

Speech Enhancement

Seeing Voices and Hearing Faces: Cross-modal biometric matching

no code implementations CVPR 2018 Arsha Nagrani, Samuel Albanie, Andrew Zisserman

We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching, (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available), and (iii) we use human testing as a baseline to calibrate the difficulty of the task.

Face Recognition Speaker Identification

Learning to Navigate in Cities Without a Map

3 code implementations NeurIPS 2018 Piotr Mirowski, Matthew Koichi Grimes, Mateusz Malinowski, Karl Moritz Hermann, Keith Anderson, Denis Teplyashin, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, Raia Hadsell

We present an interactive navigation environment that uses Google StreetView for its photographic content and worldwide coverage, and demonstrate that our learning method allows agents to learn to navigate multiple cities and to traverse to target destinations that may be kilometres away.

Autonomous Navigation reinforcement-learning

Kickstarting Deep Reinforcement Learning

no code implementations10 Mar 2018 Simon Schmitt, Jonathan J. Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M. Czarnecki, Joel Z. Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, S. M. Ali Eslami

Our method places no constraints on the architecture of the teacher or student agents, and it regulates itself to allow the students to surpass their teachers in performance.

reinforcement-learning

Smooth Loss Functions for Deep Top-k Classification

1 code implementation ICLR 2018 Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

We compare the performance of the cross-entropy loss and our margin-based losses in various regimes of noise and data size, for the predominant use case of k=5.

Classification General Classification

What have we learned from deep representations for action recognition?

no code implementations CVPR 2018 Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, Andrew Zisserman

In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video.

Action Recognition

Objects that Sound

no code implementations ECCV 2018 Relja Arandjelović, Andrew Zisserman

We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e. g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale on how to avoid undesirable shortcuts in the data preparation.

Cross-Modal Retrieval Frame +1

SilNet : Single- and Multi-View Reconstruction by Learning from Silhouettes

no code implementations21 Nov 2017 Olivia Wiles, Andrew Zisserman

The objective of this paper is 3D shape understanding from single and multiple images.

VGGFace2: A dataset for recognising faces across pose and age

17 code implementations23 Oct 2017 Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, Andrew Zisserman

The dataset was collected with three goals in mind: (i) to have both a large number of identities and also a large number of images for each identity; (ii) to cover a large range of pose, age and ethnicity; and (iii) to minimize the label noise.

 Ranked #1 on Face Verification on IJB-C (dataset metric)

Face Recognition Face Verification +1

Detect to Track and Track to Detect

3 code implementations ICCV 2017 Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year.

Frame Object Detection

Multi-task Self-Supervised Visual Learning

no code implementations ICCV 2017 Carl Doersch, Andrew Zisserman

We investigate methods for combining multiple self-supervised tasks--i. e., supervised tasks where data can be collected without manual labeling--in order to train a single visual representation.

Depth Estimation General Classification

Self-Supervised Learning for Spinal MRIs

no code implementations1 Aug 2017 Amir Jamaludin, Timor Kadir, Andrew Zisserman

We show that the performance of the pre-trained CNN on the supervised classification task is (i) superior to that of a network trained from scratch; and (ii) requires far fewer annotated training samples to reach an equivalent performance to that of the network trained from scratch.

Classification General Classification +1

Temporal HeartNet: Towards Human-Level Automatic Analysis of Fetal Cardiac Screening Video

no code implementations3 Jul 2017 Weilin Huang, Christopher P. Bridge, J. Alison Noble, Andrew Zisserman

We present an automatic method to describe clinically useful information about scanning, and to guide image interpretation in ultrasound (US) videos of the fetal heart.

Frame

VoxCeleb: a large-scale speaker identification dataset

8 code implementations Interspeech 2018 Arsha Nagrani, Joon Son Chung, Andrew Zisserman

Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance.

Sound

Look, Listen and Learn

1 code implementation ICCV 2017 Relja Arandjelović, Andrew Zisserman

We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos?

Audio Classification General Classification

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

24 code implementations CVPR 2017 Joao Carreira, Andrew Zisserman

The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks.

Action Recognition Classification +2

You said that?

1 code implementation8 May 2017 Joon Son Chung, Amir Jamaludin, Andrew Zisserman

To achieve this we propose an encoder-decoder CNN model that uses a joint embedding of the face and audio to generate synthesised talking face video frames.

Unconstrained Lip-synchronization

From Images to 3D Shape Attributes

no code implementations20 Dec 2016 David F. Fouhey, Abhinav Gupta, Andrew Zisserman

Our first objective is to infer these 3D shape attributes from a single image.

Interferences in match kernels

no code implementations24 Nov 2016 Naila Murray, Hervé Jégou, Florent Perronnin, Andrew Zisserman

The second one involves equalising the match of a single descriptor to the aggregated vector.

Image Retrieval

Lip Reading Sentences in the Wild

no code implementations CVPR 2017 Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.

Ranked #3 on Lipreading on GRID corpus (mixed-speech) (using extra training data)

Lipreading Lip Reading +1

Trusting SVM for Piecewise Linear CNNs

2 code implementations7 Nov 2016 Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

We present a novel layerwise optimization algorithm for the learning objective of Piecewise-Linear Convolutional Neural Networks (PL-CNNs), a large class of convolutional neural networks.

Signs in time: Encoding human motion as a temporal image

no code implementations6 Aug 2016 Joon Son Chung, Andrew Zisserman

The goal of this work is to recognise and localise short temporal signals in image time series, where strong supervision is not available for training.

Time Series

3D Shape Attributes

no code implementations CVPR 2016 David F. Fouhey, Abhinav Gupta, Andrew Zisserman

In this paper we investigate 3D attributes as a means to understand the shape of an object in a single image.

Recurrent Human Pose Estimation

no code implementations10 May 2016 Vasileios Belagiannis, Andrew Zisserman

We propose a novel ConvNet model for predicting 2D human body poses in an image.

Pose Estimation

Convolutional Two-Stream Network Fusion for Video Action Recognition

1 code implementation CVPR 2016 Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information.

Ranked #55 on Action Recognition on UCF101 (using extra training data)

Action Recognition Action Recognition In Videos +1

Template Adaptation for Face Verification and Identification

no code implementations12 Mar 2016 Nate Crosswhite, Jeffrey Byrne, Omkar M. Parkhi, Chris Stauffer, Qiong Cao, Andrew Zisserman

Face recognition performance evaluation has traditionally focused on one-to-one verification, popularized by the Labeled Faces in the Wild dataset for imagery and the YouTubeFaces dataset for videos.

Face Identification Face Recognition +3

Personalizing Human Video Pose Estimation

no code implementations CVPR 2016 James Charles, Tomas Pfister, Derek Magee, David Hogg, Andrew Zisserman

The outcome is a substantial improvement in the pose estimates for the target video using the personalized ConvNet compared to the original generic ConvNet.

Optical Flow Estimation Pose Estimation

Spatial Transformer Networks

44 code implementations NeurIPS 2015 Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner.

Translation

Automatic Discovery and Optimization of Parts for Image Classification

no code implementations20 Dec 2014 Sobhan Naderi Parizi, Andrea Vedaldi, Andrew Zisserman, Pedro Felzenszwalb

First, a collection of informative parts is discovered, using heuristics that promote part distinctiveness and diversity, and then classifiers are trained on the vector of part responses.

Classification General Classification +2

Deep Structured Output Learning for Unconstrained Text Recognition

no code implementations18 Dec 2014 Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

We develop a representation suitable for the unconstrained recognition of words in natural images: the general case of no fixed lexicon and unknown length.

Language Modelling Multi-Task Learning

Reading Text in the Wild with Convolutional Neural Networks

no code implementations4 Dec 2014 Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

In this work we present an end-to-end system for text spotting -- localising and recognising text in natural scene images -- and text based image retrieval.

Image Retrieval Region Proposal +2

Very Deep Convolutional Networks for Large-Scale Image Recognition

267 code implementations4 Sep 2014 Karen Simonyan, Andrew Zisserman

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting.

General Classification Image Classification

Efficient On-the-fly Category Retrieval using ConvNets and GPUs

no code implementations17 Jul 2014 Ken Chatfield, Karen Simonyan, Andrew Zisserman

We investigate the gains in precision and speed, that can be obtained by using Convolutional Networks (ConvNets) for on-the-fly retrieval - where classifiers are learnt at run time for a textual query from downloaded images, and used to rank large image or video datasets.

Binarization Quantization

Two-Stream Convolutional Networks for Action Recognition in Videos

6 code implementations NeurIPS 2014 Karen Simonyan, Andrew Zisserman

Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art.

Action Classification Action Recognition +6

Triangulation Embedding and Democratic Aggregation for Image Search

no code implementations CVPR 2014 Herve Jegou, Andrew Zisserman

We consider the design of a single vector representation for an image that embeds and aggregates a set of local patch descriptors such as SIFT.

Image Retrieval

A Compact and Discriminative Face Track Descriptor

no code implementations CVPR 2014 Omkar M. Parkhi, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

Our goal is to learn a compact, discriminative vector representation of a face track, suitable for the face recognition tasks of verification and classification.

Binarization Dimensionality Reduction +4

Talking Heads: Detecting Humans and Recognizing Their Interactions

no code implementations CVPR 2014 Minh Hoai, Andrew Zisserman

The objective of this work is to accurately and efficiently detect configurations of one or more people in edited TV material.