Grounding Language Representation with Visual Object Information via Cross Modal Pretraining

29 Sep 2021  ·  Cong-Duy T Nguyen, Anh Tuan Luu, Tho Quan ·

Previous studies of visual grounded language learning use a convolutional neural network (CNN) to extract features from the whole image for grounding with the sentence description. However, this approach has two main drawbacks: (i) the whole image usually contains more objects and backgrounds than the sentence itself; thus, matching them together will confuse the grounded model; (ii) CNN only extracts the features of the image but not the relationship between objects inside that, limiting the grounded model to learn complicated contexts. To overcome such shortcomings, we propose a novel object-level grounded language learning framework that empowers the language representation with visual object-grounded information. The framework is comprised of three main components: (i) ObjectGroundedBERT captures the visual-object relations and literary portrayals by cross-modal pretraining via a Text-grounding mechanism, (ii) Visual encoder represents a visual relation between objects and (iii) Cross-modal Transformer helps the Visual encoder and ObjectGroundedBERT learn the alignment and representation of image-text context. Experimental results show that our proposed framework consistently outperforms the baseline language models on various language tasks of GLUE and SQuAD datasets.

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods