ARVSU contains a vast body of image variations in visual scenes with an annotated utterance and a corresponding addressee for each scenario.
Source: Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with UtterancesPaper | Code | Results | Date | Stars |
---|