12 dataset results for face recog AND Audio

UDIVA is a new non-acted dataset of face-to-face dyadic interactions, where interlocutors perform competitive and collaborative tasks with different behavior elicitation and cognitive workload.

6 PAPERS • NO BENCHMARKS YET

VoxCeleb2

…The dataset is audio-visual, so is also useful for a number of other applications, for example – visual speech synthesis, speech separation, cross-modal transfer from face to voice or vice versa and training face recognition from video to complement existing face recognition datasets.

496 PAPERS • 5 BENCHMARKS

AVSpeech

…The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.

35 PAPERS • NO BENCHMARKS YET

RESD (Russian Emotional Speech Dialogs with annotated text)

…Аментес, Илья Лубенец, Никита Давидчук}, title = {Открытая библиотека искусственного интеллекта для анализа и выявления эмоциональных оттенков речи человека}, year = {2022}, publisher = {Hugging Face }, journal = {Hugging Face Hub}, howpublished = {\url{https://huggingface.com/aniemore/Aniemore}}, email = {hello@socialcode.ru} }

0 PAPER • NO BENCHMARKS YET

MHRI dataset (Multimodal Human-Robot Interaction dataset)

…covers the user manipulation and interaction with the robot An RGB-d camera mounted on the top of the robot provides a top view of the whole scenario A HD-RGB camera points to the user head to capture face

0 PAPER • NO BENCHMARKS YET

EasyCom

…contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head and face

15 PAPERS • 4 BENCHMARKS

Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 (BIWI 3D)

…The dense dynamic face scans were acquired at 25 frames per second and the RMS error in the 3D reconstruction is about 0.5 mm.

5 PAPERS • 1 BENCHMARK

GLips (German Lips)

The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline

5 PAPERS • NO BENCHMARKS YET

ITALIC

…The dataset is available on Zenodo and connectors ara available for the HuggingFace Hub.

2 PAPERS • NO BENCHMARKS YET

BAVL (Blind Audio-Visual Localization (BAVL))

Blind Audio-Visual Localization (BAVL) Dataset consists of 20 audio-visual recordings of sound sources, which could be talking faces or music instruments.

0 PAPER • NO BENCHMARKS YET

Voice Conversion Challenge 2018

…objective of the 2016 challenge was to better understand different VC techniques built on a freely-available common dataset to look at a common goal, and to share views about unsolved problems and challenges faced

3 PAPERS • NO BENCHMARKS YET

nEMO

…The dataset is available on Hugging Face and GitHub. Data Fields file_id - filename, i.e.

1 PAPER • NO BENCHMARKS YET

Datasets

12 dataset results for face recog AND Audio