Co-speech gestures are everywhere. People make gestures when they chat with others, give a public speech, talk on a phone, and even think aloud. Despite this ubiquity, there are not many datasets available. The main reason is that it is expensive to recruit actors/actresses and track precise body motions. There are a few datasets available (e.g., MSP AVATAR [17] and Personality Dyads Corpus [18]), but their sizes are limited to less than 3 h, and they lack diversity in speech content and speakers. The gestures also could be unnatural owing to inconvenient body tracking suits and acting in a lab environment.
Thus, we collected a new dataset of co-speech gestures: the TED Gesture Dataset. TED is a conference where people share their ideas from a stage, and recordings of these talks are available online. Using TED talks has the following advantages compared to the existing datasets:
• Large enough to learn the mapping from speech to gestures. The number of videos continues to grow. • Various speech content and speakers. There are thousands of unique speakers, and they talk about their own ideas and stories. • The speeches are well prepared, so we expect that the speakers use proper hand gestures. • Favorable for automation of data collection and annotation. All talks come with transcripts, and flat background and steady shots make extracting human poses with computer vision technology easier.
Paper | Code | Results | Date | Stars |
---|