Image and Text fusion for UPMC Food-101 \\using BERT and CNNs
The modern digital world is becoming more and more multimodal. Looking on the internet, images are often associated with the text, so classification problems with these two modalities are very common. In this paper, we examine multimodal classification using textual information and visual representations of the same concept. We investigate two main basic methods to perform multimodal fusion and adapt them with stacking techniques to better handle this type of problem. Here, we use UPMC Food-101, which is a difficult and noisy multimodal dataset that well represents this category of multimodal problems. Our results show that the proposed early fusion technique combined with a stacking-based approach exceeds the state of the art on the dataset used.
PDF AbstractDatasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Document Text Classification | Food-101 | Bert | Accuracy (%) | 84.41 | # 1 | |
Multimodal Text and Image Classification | Food-101 | Late Fusion (Bert + InceptionV3) | Accuracy (%) | 84.59 | # 2 | |
Multimodal Text and Image Classification | Food-101 | Early Fusion (Bert + InceptionV3) | Accuracy (%) | 92.5 | # 1 | |
Multi-Modal Document Classification | Food-101 | Late Fusion (Bert + InceptionV3) | 1:1 Accuracy | 84.59 | # 2 | |
Multi-Modal Document Classification | Food-101 | Early Fusion (Bert + InceptionV3) | 1:1 Accuracy | 92.5 | # 1 | |
Image Classification | Food-101 | Inception V3 | Accuracy (%) | 71.67 | # 6 |