no code implementations • 9 May 2022 • Gül Varol, Liliane Momeni, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman
The focus of this work is $\textit{sign spotting}$ - given a video of an isolated sign, our task is to identify $\textit{whether}$ and $\textit{where}$ it has been signed in a continuous, co-articulated sign language video.
no code implementations • 8 Dec 2021 • Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
no code implementations • 5 Nov 2021 • Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland, Andrew Zisserman
In this work, we introduce the BBC-Oxford British Sign Language (BOBSL) dataset, a large-scale video collection of British Sign Language (BSL).
1 code implementation • 29 Oct 2021 • K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman
In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting.
Ranked #1 on
Visual Keyword Spotting
on LRS2
no code implementations • 14 Oct 2021 • K R Prajwal, Triantafyllos Afouras, Andrew Zisserman
To this end, we make the following contributions: (1) we propose an attention-based pooling mechanism to aggregate visual speech representations; (2) we use sub-word units for lip reading for the first time and show that this allows us to better model the ambiguities of the task; (3) we propose a model for Visual Speech Detection (VSD), trained on top of the lip reading network.
Ranked #1 on
Lipreading
on LRS2
(using extra training data)
Audio-Visual Active Speaker Detection
Automatic Speech Recognition
+3
no code implementations • ICCV 2021 • Hannah Bull, Triantafyllos Afouras, Gül Varol, Samuel Albanie, Liliane Momeni, Andrew Zisserman
The goal of this work is to temporally align asynchronous subtitles in sign language videos.
no code implementations • 13 Apr 2021 • Triantafyllos Afouras, Yuki M. Asano, Francois Fagan, Andrea Vedaldi, Florian Metze
We tackle the problem of learning object detectors without supervision.
1 code implementation • CVPR 2021 • Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset.
no code implementations • CVPR 2021 • Gül Varol, Liliane Momeni, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman
Our contributions are as follows: (1) we demonstrate the ability to leverage large quantities of continuous signing videos with weakly-aligned subtitles to localise signs in continuous sign language; (2) we employ the learned attention to automatically generate hundreds of thousands of annotations for a large sign vocabulary; (3) we collect a set of 37K manually verified sign instances across a vocabulary of 950 sign classes to support our study of sign language recognition; (4) by training on the newly annotated data from our method, we outperform the prior state of the art on the BSL-1K sign language recognition benchmark.
1 code implementation • 8 Oct 2020 • Liliane Momeni, Gül Varol, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman
The focus of this work is sign spotting - given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.
1 code implementation • 2 Sep 2020 • Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, Andrew Zisserman
The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio.
no code implementations • ECCV 2020 • Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning.
1 code implementation • ECCV 2020 • Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, Andrew Zisserman
Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality.
Ranked #2 on
Sign Language Recognition
on WLASL-2000
no code implementations • 2 Jul 2020 • Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman
Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community.
no code implementations • 28 Nov 2019 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data.
Ranked #10 on
Lipreading
on LRS2
(using extra training data)
no code implementations • 11 Jul 2019 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice.
2 code implementations • 6 Sep 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
Ranked #5 on
Audio-Visual Speech Recognition
on LRS3-TED
no code implementations • 3 Sep 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition.
Audio-Visual Speech Recognition
Visual Speech Recognition
+1
no code implementations • 15 Jun 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
The goal of this paper is to develop state-of-the-art models for lip reading -- visual speech recognition.
no code implementations • 11 Apr 2018 • Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman
Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos.
6 code implementations • 24 May 2017 • Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, Shimon Whiteson
COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies.
4 code implementations • ICML 2017 • Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip H. S. Torr, Pushmeet Kohli, Shimon Whiteson
Many real-world problems, such as network packet routing and urban traffic control, are naturally modeled as multi-agent reinforcement learning (RL) problems.