The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery. For a description of the corpus, see:
D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015
Paper | Code | Results | Date | Stars |
---|