Previous research has shown that traditional metrics used to optimize and assess models for generating lip motion from speech are not a good indicator of subjective opinion of animation quality.
Federated learning enables the deployment of machine learning to problems for which centralized data collection is impractical.
One such high impact domain is that of face recognition, with real world applications involving images affected by various degradations, such as motion blur or high exposure.
Machine learning models are trained to minimize the mean loss for a single metric, and thus typically do not consider fairness and robustness.
Images generated using MorphGAN conserve the identity of the person in the original image, and the provided control over head pose and facial expression allows test sets to be created to identify robustness issues of a facial recognition deep network with respect to pose and expression.
We use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout.
We show that expert units are important in several ways: (1) The presence of expert units is correlated ($r^2=0. 833$) with the generalization power of TM, which allows ranking TM without requiring fine-tuning on suites of downstream tasks.
We conclude that visual speech synthesis can significantly benefit from the powerful representation of speech in the ASR acoustic models.
Principal Filter Analysis (PFA) is an easy to implement, yet effective method for neural network compression.
We describe experiments towards building a conversational digital assistant that considers the preferred conversational style of the user.
We model the individual behavior for each agent in an interaction and then use a multi-agent fusion model to generate a summary over the expected actions of the group to render the model independent of the number of agents.
We propose two algorithms: the first allows users to target compression to specific network property, such as number of trainable variable (footprint), and produces a compressed model that satisfies the requested property while preserving the maximum amount of spectral energy in the responses of each layer, while the second is a parameter-free heuristic that selects the compression used at each layer by trying to mimic an ideal set of uncorrelated responses.