no code implementations • 20 Jan 2017 • Stavros Petridis, Zuwei Li, Maja Pantic
Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage.
no code implementations • 24 Mar 2017 • Zukang Liao, Stavros Petridis, Maja Pantic
We tested the proposed modified local deep neural networks approach on the LFW and Adience databases for the task of gender and age classification.
no code implementations • 1 Sep 2017 • Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic
To the best of our knowledge, this is the first model which simultaneously learns to extract features directly from the pixels and performs visual speech classification from multiple views and also achieves state-of-the-art performance.
no code implementations • 12 Sep 2017 • Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic
To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the pixels and spectrograms and perform classification of speech and nonlinguistic vocalisations.
2 code implementations • 18 Feb 2018 • Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Feipeng Cai, Georgios Tzimiropoulos, Maja Pantic
In presence of high levels of noise, the end-to-end audiovisual model significantly outperforms both audio-only models.
Ranked #17 on Lipreading on Lip Reading in the Wild
no code implementations • 18 Feb 2018 • Stavros Petridis, Jie Shen, Doruk Cetin, Maja Pantic
We show that an absolute decrease in classification rate of up to 3. 7% is observed when training and testing on normal and whispered, respectively, and vice versa.
1 code implementation • 10 Apr 2018 • Yujiang Wang, Jie Shen, Stavros Petridis, Maja Pantic
In this paper, we present an effective and unsupervised face Re-ID system which simultaneously re-identifies multiple faces for HRI.
1 code implementation • 23 May 2018 • Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
To the best of our knowledge, this is the first method capable of generating subject independent realistic videos directly from raw audio.
no code implementations • 19 Jul 2018 • Yen Khye Lim, Zukang Liao, Stavros Petridis, Maja Pantic
This paper presents a classifier ensemble for Facial Expression Recognition (FER) based on models derived from transfer learning.
no code implementations • 28 Sep 2018 • Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios Tzimiropoulos, Maja Pantic
Therefore, we could use a CTC loss in combination with an attention-based model in order to force monotonic alignments and at the same time get rid of the conditional independence assumption.
Ranked #5 on Audio-Visual Speech Recognition on LRS2
Audio-Visual Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 2 Apr 2019 • Stavros Petridis, Yujiang Wang, Pingchuan Ma, Zuwei Li, Maja Pantic
In this work, we present an end-to-end visual speech recognition system based on fully-connected layers and Long-Short Memory (LSTM) networks which is suitable for small-scale datasets.
no code implementations • 5 Jun 2019 • Pingchuan Ma, Stavros Petridis, Maja Pantic
Several audio-visual speech recognition models have been recently proposed which aim to improve the robustness over audio-only models in the presence of noise.
no code implementations • 14 Jun 2019 • Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantic
Speech is a means of communication which relies on both audio and visual information.
no code implementations • 14 Jun 2019 • Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features.
no code implementations • 14 Nov 2019 • Shiyang Cheng, Pingchuan Ma, Georgios Tzimiropoulos, Stavros Petridis, Adrian Bulat, Jie Shen, Maja Pantic
The proposed model significantly outperforms previous approaches on non-frontal views while retaining the superior performance on frontal and near frontal mouth views.
no code implementations • 12 Dec 2019 • Triantafyllos Kefalas, Konstantinos Vougioukas, Yannis Panagakis, Stavros Petridis, Jean Kossaifi, Maja Pantic
Speech-driven facial animation involves using a speech signal to generate realistic videos of talking faces.
no code implementations • 18 Dec 2019 • Pingchuan Ma, Stavros Petridis, Maja Pantic
In this work, we propose an efficient and straightforward detection method based on the temporal correlation between audio and video streams.
no code implementations • 13 Jan 2020 • Abhinav Shukla, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Maja Pantic
Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities.
Ranked #8 on Speech Emotion Recognition on CREMA-D
2 code implementations • 23 Jan 2020 • Brais Martinez, Pingchuan Ma, Stavros Petridis, Maja Pantic
We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively.
Ranked #7 on Lipreading on CAS-VSR-W1k (LRW-1000)
no code implementations • 4 May 2020 • Abhinav Shukla, Stavros Petridis, Maja Pantic
Our results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self-supervision leads to more informative audio representations for speech and emotion recognition.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +5
no code implementations • 8 Jul 2020 • Abhinav Shukla, Stavros Petridis, Maja Pantic
This enriches the audio encoder with visual information and the encoder can be used for evaluation without the visual modality.
1 code implementation • 13 Jul 2020 • Pingchuan Ma, Brais Martinez, Stavros Petridis, Maja Pantic
However, our most promising lightweight models are on par with the current state-of-the-art while showing a reduction of 8. 2x and 3. 9x in terms of computational cost and number of parameters, respectively, which we hope will enable the deployment of lipreading models in practical applications.
Ranked #4 on Lipreading on Lip Reading in the Wild
1 code implementation • 29 Sep 2020 • Pingchuan Ma, Yujiang Wang, Jie Shen, Stavros Petridis, Maja Pantic
In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words.
no code implementations • 7 Oct 2020 • Dominika Woszczyk, Stavros Petridis, David Millard
The results are compared to a speaker-adaptive (SA) model as well as speaker-dependent (SD) and multi-task learning models (MTL).
1 code implementation • CVPR 2021 • Alexandros Haliassos, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Extensive experiments show that this simple approach significantly surpasses the state-of-the-art in terms of generalisation to unseen manipulations and robustness to perturbations, as well as shed light on the factors responsible for its performance.
Ranked #5 on DeepFake Detection on FakeAVCeleb
3 code implementations • 12 Feb 2021 • Pingchuan Ma, Stavros Petridis, Maja Pantic
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.
Ranked #3 on Audio-Visual Speech Recognition on LRS2
Audio-Visual Speech Recognition Automatic Speech Recognition (ASR) +6
1 code implementation • ICLR 2021 • Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Domain translation is the process of transforming data from one domain to another while preserving the common semantics.
no code implementations • 27 Apr 2021 • Rodrigo Mira, Konstantinos Vougioukas, Pingchuan Ma, Stavros Petridis, Björn W. Schuller, Maja Pantic
In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm.
no code implementations • 16 Jun 2021 • Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Björn W. Schuller, Maja Pantic
The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning.
no code implementations • 18 Oct 2021 • Rafael Poyiadzi, Jie Shen, Stavros Petridis, Yujiang Wang, Maja Pantic
We then study the effect of variety and number of age-groups used during training on generalisation to unseen age-groups and observe that an increase in the number of training age-groups tends to increase the apparent emotional facial expression recognition performance on unseen age-groups.
Facial Expression Recognition Facial Expression Recognition (FER)
1 code implementation • CVPR 2022 • Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, Maja Pantic
One of the most pressing challenges for the detection of face-manipulated videos is generalising to forgery methods not seen during training while remaining effective under common corruptions such as compression.
Ranked #2 on DeepFake Detection on FakeAVCeleb
2 code implementations • 26 Feb 2022 • Pingchuan Ma, Stavros Petridis, Maja Pantic
However, these advances are usually due to the larger training sets rather than the model design.
Ranked #1 on Lipreading on GRID corpus (mixed-speech) (using extra training data)
no code implementations • 24 Mar 2022 • Yujiang Wang, Mingzhi Dong, Jie Shen, Yiming Luo, Yiming Lin, Pingchuan Ma, Stavros Petridis, Maja Pantic
We also investigate face clustering in egocentric videos, a fast-emerging field that has not been studied yet in works related to face clustering.
Ranked #1 on Face Clustering on EasyCom
2 code implementations • 4 May 2022 • Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Björn W. Schuller, Maja Pantic
Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio.
1 code implementation • 3 Sep 2022 • Pingchuan Ma, Yujiang Wang, Stavros Petridis, Jie Shen, Maja Pantic
In this paper, we systematically investigate the performance of state-of-the-art data augmentation approaches, temporal models and other training strategies, like self-distillation and using word boundary indicators.
Ranked #1 on Lipreading on Lip Reading in the Wild (using extra training data)
no code implementations • 20 Oct 2022 • Marija Jegorova, Stavros Petridis, Maja Pantic
This work focuses on the apparent emotional reaction recognition (AERR) from the video-only input, conducted in a self-supervised fashion.
no code implementations • 3 Nov 2022 • Pingchuan Ma, Niko Moritz, Stavros Petridis, Christian Fuegen, Maja Pantic
In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture.
Audio-Visual Speech Recognition Automatic Speech Recognition +5
no code implementations • 20 Nov 2022 • Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna Ithapu, Maja Pantic
Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements.
1 code implementation • 12 Dec 2022 • Alexandros Haliassos, Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Maja Pantic
We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained.
Ranked #1 on Speech Recognition on LRS2 (using extra training data)
no code implementations • 6 Jan 2023 • Michał Stypułkowski, Konstantinos Vougioukas, Sen He, Maciej Zięba, Stavros Petridis, Maja Pantic
Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos.
no code implementations • 14 Mar 2023 • Andreas Zinonos, Alexandros Haliassos, Pingchuan Ma, Stavros Petridis, Maja Pantic
Cross-lingual self-supervised learning has been a growing research topic in the last few years.
1 code implementation • 25 Mar 2023 • Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic
Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets.
Ranked #1 on Automatic Speech Recognition (ASR) on LRS3-TED
Audio-Visual Speech Recognition Automatic Speech Recognition +4
no code implementations • CVPR 2023 • Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen
Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16. 9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90, 000 hours).
no code implementations • 5 May 2023 • Yujiang Wang, Anshul Thakur, Mingzhi Dong, Pingchuan Ma, Stavros Petridis, Li Shang, Tingting Zhu, David A. Clifton
The prevalence of artificial intelligence (AI) has envisioned an era of healthcare democratisation that promises every stakeholder a new and better way of life.
1 code implementation • 15 May 2023 • Antoni Bigata Casademunt, Rodrigo Mira, Nikita Drobyshev, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Speech-driven animation has gained significant traction in recent years, with current methods achieving near-photorealistic results.
no code implementations • 10 Jul 2023 • Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Alexandros Haliassos, Stavros Petridis, Maja Pantic
We evaluate our 50% sparse model on 7 different visual noise types and achieve an overall absolute improvement of more than 2% WER compared to the dense equivalent.
1 code implementation • 27 Oct 2023 • Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, Pingchuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis
TorchAudio is an open-source audio and speech processing library built for PyTorch.
no code implementations • 17 Jan 2024 • Yufeng Yin, Ishwarya Ananthabhotla, Vamsi Krishna Ithapu, Stavros Petridis, Yu-Hsiang Wu, Christi Miller
In this work, we build on this idea and introduce the problem of detecting hearing loss from an individual's facial expressions during a conversation.
1 code implementation • 2 Apr 2024 • Alexandros Haliassos, Andreas Zinonos, Rodrigo Mira, Stavros Petridis, Maja Pantic
In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data.