Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

We address the problem of phrase grounding by lear ing a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as contextualized word and sentence embeddings extracted from a character-based language model... (read more)

PDF Abstract CVPR 2019 PDF CVPR 2019 Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Phrase Grounding Flickr30k COCO_ELMo_PNASNet Pointing Game Accuracy 69.19 # 1
Phrase Grounding ReferIt VG_BiLSTM_VGG Pointing Game Accuracy 62.76 # 1
Phrase Grounding Visual Genome VG_ELMo_PNASNet Pointing Game Accuracy 55.16 # 1

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet