The implicit assumption of this task is that the sound signal is either missing or contains a high amount of noise/corruption such that it is not useful for processing.
We evaluate our 50% sparse model on 7 different visual noise types and achieve an overall absolute improvement of more than 2% WER compared to the dense equivalent.
Most established approaches to date involve a two-step process, whereby an intermediate representation from the video, such as a spectrogram, is extracted first and then passed to a vocoder to produce the raw audio.
Speech-driven animation has gained significant traction in recent years, with current methods achieving near-photorealistic results.
no code implementations • • Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen
Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16. 9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90, 000 hours).
Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets.
Ranked #1 on Automatic Speech Recognition (ASR) on LRS3-TED
Cross-lingual self-supervised learning has been a growing research topic in the last few years.
Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos.
We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained.
Ranked #1 on Speech Recognition on LRS2 (using extra training data)
Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements.
Our model consists of a hybrid network of convolution and transformer blocks to learn per-AU features and to model AU co-occurrences.
In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture.
This work focuses on the apparent emotional reaction recognition (AERR) from the video-only input, conducted in a self-supervised fashion.
In this paper, we systematically investigate the performance of state-of-the-art data augmentation approaches, temporal models and other training strategies, like self-distillation and using word boundary indicators.
Ranked #1 on Lipreading on Lip Reading in the Wild (using extra training data)
Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio.
We also investigate face clustering in egocentric videos, a fast-emerging field that has not been studied yet in works related to face clustering.
Ranked #1 on Face Clustering on EasyCom
However, these advances are usually due to the larger training sets rather than the model design.
Ranked #1 on Lipreading on GRID corpus (mixed-speech) (using extra training data)
One of the most pressing challenges for the detection of face-manipulated videos is generalising to forgery methods not seen during training while remaining effective under common corruptions such as compression.
Ranked #2 on DeepFake Detection on FakeAVCeleb
We propose defensive tensorization, an adversarial defence technique that leverages a latent high-order factorization of the network.
We then study the effect of variety and number of age-groups used during training on generalisation to unseen age-groups and observe that an increase in the number of training age-groups tends to increase the apparent emotional facial expression recognition performance on unseen age-groups.
In this work, we describe, evaluate and release a dataset that contains over 5 hours of multi-modal data useful for training and testing algorithms for the application of improving conversations for an AR glasses wearer.
Ranked #1 on Speech Enhancement on EasyCom
To evaluate our method on in-the-wild data, we also introduce a new challenging large-scale benchmark called IMDB-Clean.
Ranked #1 on Age Estimation on KANFace
The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning.
In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm.
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.
Ranked #3 on Audio-Visual Speech Recognition on LRS2
Face parsing aims to predict pixel-wise labels for facial components of a target face in an image.
Ranked #1 on Face Parsing on iBugMask
To perform efficient inference for GMM priors, we introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs.
Extensive experiments show that this simple approach significantly surpasses the state-of-the-art in terms of generalisation to unseen manipulations and robustness to perturbations, as well as shed light on the factors responsible for its performance.
Ranked #5 on DeepFake Detection on FakeAVCeleb
In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words.
Deep generative models rely on their inductive bias to facilitate generalization, especially for problems with high dimensional data, like images.
However, our most promising lightweight models are on par with the current state-of-the-art while showing a reduction of 8. 2x and 3. 9x in terms of computational cost and number of parameters, respectively, which we hope will enable the deployment of lipreading models in practical applications.
Ranked #4 on Lipreading on Lip Reading in the Wild
This enriches the audio encoder with visual information and the encoder can be used for evaluation without the visual modality.
By evaluating on several age-annotated datasets in both single- and cross-database experiments, we show that the proposed method outperforms state-of-the-art algorithms for age transfer, especially in the case of age groups that lie in the tails of the label distribution.
Introducing LI mechanisms improves the convolutional filter's sensitivity to semantic object boundaries.
In this work, we investigate the demographic bias of deep learning models in face recognition, age estimation, gender recognition and kinship verification.
Our results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self-supervision leads to more informative audio representations for speech and emotion recognition.
In addition, with a reduction of 3x in model size and complexity, we show no decrease in performance when compared to the original HourGlass network.
Ranked #2 on Pose Estimation on Leeds Sports Poses
We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively.
Ranked #7 on Lipreading on CAS-VSR-W1k (LRW-1000)
Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities.
Ranked #7 on Speech Emotion Recognition on CREMA-D
In this work, we propose an efficient and straightforward detection method based on the temporal correlation between audio and video streams.
Speech-driven facial animation involves using a speech signal to generate realistic videos of talking faces.
The proposed model significantly outperforms previous approaches on non-frontal views while retaining the superior performance on frontal and near frontal mouth views.
Specifically, we learn the shape prior from our dataset using VAE-GAN, and leverage the pre-trained encoder and discriminator to regularise the training of SegNet.
In this paper, we propose a deep learning approach for facial AU detection that can easily and in a fast manner adapt to a new AU or target subject by leveraging only a few labeled samples from the new task (either an AU or subject).
As deep neural networks become widely adopted for solving most problems in computer vision and audio-understanding, there are rising concerns about their potential vulnerability.
no code implementations • 10 Jul 2019 • Fabien Ringeval, Björn Schuller, Michel Valstar, NIcholas Cummins, Roddy Cowie, Leili Tavabi, Maximilian Schmitt, Sina Alisamir, Shahin Amiriparian, Eva-Maria Messner, Siyang Song, Shuo Liu, Ziping Zhao, Adria Mallol-Ragolta, Zhao Ren, Mohammad Soleymani, Maja Pantic
The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) "State-of-Mind, Detecting Depression with AI, and Cross-cultural Affect Recognition" is the ninth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions.
To the best of our knowledge, this is the first work to use reinforcement learning for online key-frame decision in dynamic video segmentation, and also the first work on its application on face videos.
We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on handcrafted intermediate features.
To alleviate this, one approach is to apply low-rank tensor decompositions to convolution kernels in order to compress the network and reduce its number of parameters.
Several audio-visual speech recognition models have been recently proposed which aim to improve the robustness over audio-only models in the presence of noise.
This paper is on improving the training of binary neural networks in which both activations and weights are binary.
Adapting the learned classification to new domains is a hard problem due to at least three reasons: (1) the new domains and the tasks might be drastically different; (2) there might be very limited amount of annotated data on the new domain and (3) full training of a new model for each new task is prohibitive in terms of computation and memory, due to the sheer number of parameters of deep CNNs.
Big neural networks trained on large datasets have advanced the state-of-the-art for a large variety of challenging problems, improving performance by a large margin.
In this paper, we propose to fully parametrize Convolutional Neural Networks (CNNs) with a single high-order, low-rank tensor.
Ranked #35 on Pose Estimation on MPII Human Pose
In this work, we present an end-to-end visual speech recognition system based on fully-connected layers and Long-Short Memory (LSTM) networks which is suitable for small-scale datasets.
no code implementations • 9 Jan 2019 • Jean Kossaifi, Robert Walecki, Yannis Panagakis, Jie Shen, Maximilian Schmitt, Fabien Ringeval, Jing Han, Vedhas Pandit, Antoine Toisoul, Bjorn Schuller, Kam Star, Elnar Hajiyev, Maja Pantic
Natural human-computer interaction and audio-visual human behaviour sensing systems, which would achieve robust performance in-the-wild are more needed than ever as digital devices are increasingly becoming an indispensable part of our life.
Therefore, we could use a CTC loss in combination with an attention-based model in order to force monotonic alignments and at the same time get rid of the conditional independence assumption.
Ranked #5 on Audio-Visual Speech Recognition on LRS2
Inspired by the recent development of deep network-based methods in semantic image segmentation, we introduce an end-to-end trainable model for face mask extraction in video sequence.
This paper presents a classifier ensemble for Facial Expression Recognition (FER) based on models derived from transfer learning.
The progress we are currently witnessing in many computer vision applications, including automatic face analysis, would not be made possible without tremendous efforts in collecting and annotating large scale visual databases.
36 state-of-the-art trackers, including facial landmark trackers, generic object trackers and trackers that we have fine-tuned or improved, are evaluated.
To the best of our knowledge, this is the first method capable of generating subject independent realistic videos directly from raw audio.
In this paper, we present an effective and unsupervised face Re-ID system which simultaneously re-identifies multiple faces for HRI.
In this framework, we treat instance-labels as temporally-dependent latent variables in an Undirected Graphical Model.
In presence of high levels of noise, the end-to-end audiovisual model significantly outperforms both audio-only models.
Ranked #17 on Lipreading on Lip Reading in the Wild
We show that an absolute decrease in classification rate of up to 3. 7% is observed when training and testing on normal and whispered, respectively, and vice versa.
Computational facial models that capture properties of facial cues related to aging and kinship increasingly attract the attention of the research community, enabling the development of reliable methods for age progression, age estimation, age-invariant facial characterization, and kinship verification from visual data.
4DFAB contains recordings of 180 subjects captured in four different sessions spanning over a five-year period.
Deep generative models learned through adversarial training have become increasingly popular for their ability to generate naturalistic image textures.
To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the pixels and spectrograms and perform classification of speech and nonlinguistic vocalisations.
To the best of our knowledge, this is the first model which simultaneously learns to extract features directly from the pixels and performs visual speech classification from multiple views and also achieves state-of-the-art performance.
The goal of this paper is to model these structures and estimate complex feature representations simultaneously by combining conditional random field (CRF) encoded AU dependencies with deep learning.
Potentially, this makes VAEs a suitable approach for learning facial features for AU intensity estimation.
We tested the proposed modified local deep neural networks approach on the LFW and Adience databases for the task of gender and age classification.
The FG 2017 Facial Expression Recognition and Analysis challenge (FERA 2017) extends FERA 2015 to the estimation of Action Units occurrence and intensity under different camera views.
Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage.
In this paper, we address the Multi-Instance-Learning (MIL) problem when bag labels are naturally represented as ordinal variables (Multi--Instance--Ordinal Regression).
In particular, we introduce GP encoders to project multiple observed features onto a latent space, while GP decoders are responsible for reconstructing the original features.
Joint modeling of the intensity of facial action units (AUs) from face images is challenging due to the large number of AUs (30+) and their intensity levels (6).
no code implementations • 5 May 2016 • Michel Valstar, Jonathan Gratch, Bjorn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Guiota Stratou, Roddy Cowie, Maja Pantic
The Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions.
The adaptation of the classifier is facilitated in probabilistic fashion by conditioning the target expert on multiple source experts.
We propose a novel multi-conditional latent variable model for simultaneous facial feature fusion and detection of facial action units.
The proposed method is assessed in frontal face reconstruction, face landmark localization, pose-invariant face recognition, and face verification in unconstrained conditions.
For instance, in the case of AU detection, the goal is to discriminate between the segments of an image sequence in which this AU is active or inactive.
Our model is a latent tree (LT) that represents input features of facial landmark points and FAU intensities as leaf nodes, and encodes their higher-order dependencies with latent nodes at tree levels closer to the root.
The proposed method is assessed in frontal face reconstruction (pose correction), face landmark localization, and pose-invariant face recognition and verification by conducting experiments on $6$ facial images databases.
A key problem often encountered by many learning algorithms in computer vision dealing with high dimensional data is the so called "curse of dimensionality" which arises when the available training samples are less than the input feature space dimensionality.
In this paper we introduce a new distance for robustly matching vectors of 3D rotations.
Next, to correct the fittings of a generic model, image congealing (i. e., batch image aliment) is performed by employing only the learnt orthonormal subspace.
We propose very efficient strategies to update the model and we show that is possible to automatically construct robust discriminative person and imaging condition specific models 'in-the-wild' that outperform state-of-the-art generic face alignment strategies.
To address this limitation, in this paper, we propose to jointly optimize a part-based, trained in-the-wild, flexible appearance model along with a global shape model which results in a joint translational motion model for the model parts via Gauss-Newton (GN) optimization.
We present a novel discriminative regression based approach for the Constrained Local Models (CLMs) framework, referred to as the Discriminative Response Map Fitting (DRMF) method, which shows impressive performance in the generic face fitting scenario.
The superiority of the proposed method against the state-of-the-art time alignment methods, namely the canonical time warping and the generalized time warping, is indicated by the experimental results on both synthetic and real datasets.
We present a unifying framework which reduces the construction of probabilistic component analysis techniques to a mere selection of the latent neighbourhood, thus providing an elegant and principled framework for creating novel component analysis models as well as constructing probabilistic equivalents of deterministic component analysis methods.
We propose a novel method for automatic pain intensity estimation from facial images based on the framework of kernel Conditional Ordinal Random Fields (KCORF).