2 dataset results for Videos AND Japanese

AVSpeech is a large-scale audio-visual dataset comprising speech clips with no interfering background signals. The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.

35 PAPERS • NO BENCHMARKS YET

STAIR Actions Captions

A large-scale Japanese video caption dataset consisting of 79,822 videos and 399,233 captions. Each caption in the dataset describes a video in the form of "who does what and where."

1 PAPER • NO BENCHMARKS YET

Datasets

2 dataset results for Videos AND Japanese