Video-to-Text (VTT) is the task of automatically generating descriptions for short audio-visual video clips, which can support visually impaired people to understand scenes of a YouTube video for instance.
For both models, we train on the complete VATEX dataset and 90% of the TRECVID-VTT dataset for pretraining while using the remaining 10% for validation.
Automatic medical report generation from chest X-ray images is one possibility for assisting doctors to reduce their workload.
Furthermore, we introduce a novel metric that allows us to assess whether the generated captions meet our requirements (i. e., subject, predicate, object, and product name) and describe a series of experiments on caption quality and how to address annotator disagreements for the image ratings with an approach called soft targets.
Thanks to adding the third output modality, it also considerably improves the quality of generated captions for images depicting branded products.