Image and Text fusion for UPMC Food-101 \\using BERT and CNNs

17 Dec 2020  ·  Ignazio Gallo, Gianmarco Ria, Nicola Landro, and Riccardo La Grassa ·

The modern digital world is becoming more and more multimodal. Looking on the internet, images are often associated with the text, so classification problems with these two modalities are very common. In this paper, we examine multimodal classification using textual information and visual representations of the same concept. We investigate two main basic methods to perform multimodal fusion and adapt them with stacking techniques to better handle this type of problem. Here, we use UPMC Food-101, which is a difficult and noisy multimodal dataset that well represents this category of multimodal problems. Our results show that the proposed early fusion technique combined with a stacking-based approach exceeds the state of the art on the dataset used.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Document Text Classification Food-101 Bert Accuracy (%) 84.41 # 1
Multimodal Text and Image Classification Food-101 Late Fusion (Bert + InceptionV3) Accuracy (%) 84.59 # 2
Multimodal Text and Image Classification Food-101 Early Fusion (Bert + InceptionV3) Accuracy (%) 92.5 # 1
Multi-Modal Document Classification Food-101 Late Fusion (Bert + InceptionV3) 1:1 Accuracy 84.59 # 2
Multi-Modal Document Classification Food-101 Early Fusion (Bert + InceptionV3) 1:1 Accuracy 92.5 # 1
Image Classification Food-101 Inception V3 Accuracy (%) 71.67 # 6

Methods


No methods listed for this paper. Add relevant methods here