SPEECH-COCO contains speech captions that are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images.
Source: SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data SetPaper | Code | Results | Date | Stars |
---|