Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds.
Action spotting in soccer videos is the task of identifying the specific time when a certain key action of the game occurs.
Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end.
Orientation is a crucial skill for football players that becomes a differential factor in a large set of events, especially the ones involving passes.
In a soccer game, the information provided by detecting and tracking brings crucial clues to further analyze and understand some tactical aspects of the game, including individual and team actions.
In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual selfsupervised task.
Given a monocular video of a soccer match, this paper presents a computational model to estimate the most feasible pass at any given time.
In music source separation, the number of sources may vary for each piece and some of the sources may belong to the same family of instruments, thus sharing timbral characteristics and making the sources more correlated.
Sound Audio and Speech Processing
Both acoustic and visual information influence human perception of speech.
However, Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separation and attempts to achieve a performance comparable to that of the dedicated models.
Although orientation has proven to be a key skill of soccer players in order to succeed in a broad spectrum of plays, body orientation is a yet-little-explored area in sports analytics' research.
The presented system could be used as a source of data gathering in order to extract useful statistics and semantic analyses a posteriori.
We also propose a technique for measuring the similarity between activation maps and audio features which typically presented in the form of a matrix, such as chromagrams or spectrograms.
Tracking sports players is a widely challenging scenario, specially in single-feed videos recorded in tight courts, where cluttering and occlusions cannot be avoided.
Can we perform an end-to-end music source separation with a variable number of sources using a deep learning model?
We propose a large displacement optical flow method that introduces a new strategy to compute a good local minimum of any optical flow energy functional.
This paper presents a computational model to recover the most likely interpretation of the 3D scene structure from a planar image, where some objects may occlude others.