AVSpeech is a large-scale audio-visual dataset comprising speech clips with no interfering background signals. The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
35 PAPERS • NO BENCHMARKS YET
WiTA (Writing in The Air) is a dataset for the challenging writing in the air (WiTA) task -- an elaborate task bridging vision and NLP. The dataset consists of five sub-datasets in two languages (Korean and English) and amounts to 209,926 video instances from 122 participants. Finger movement for WiTA is captured with RGB cameras to ensure wide accessibility and cost-efficiency.
1 PAPER • NO BENCHMARKS YET