Are These Birds Similar: Learning Branched Networks for Fine-grained Representations
Fine-grained image classification is a challenging task due to the presence of hierarchical coarse-to-fine-grained distribution in the dataset. Generally, parts are used to discriminate various objects in fine-grained datasets, however, not all parts are beneficial and indispensable. In recent years, natural language descriptions are used to obtain information on discriminative parts of the object. This paper leverages on natural language description and proposes a strategy for learning the joint representation of natural language description and images using a two-branch network with multiple layers to improve the fine-grained classification task. Extensive experiments show that our approach gains significant improvements in accuracy for the fine-grained image classification task. Furthermore, our method achieves new state-of-the-art results on the CUB-200-2011 dataset.
PDF AbstractDatasets
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Multi-Modal Document Classification | CUB-200-2011 | Two Branch Network (Text - Bert + Image - Nts-Net) | 1:1 Accuracy | 96.81 | # 1 | |
Fine-Grained Image Classification | CUB-200-2011 | Nts-Net | Accuracy | 87.5 | # 20 | |
Multimodal Text and Image Classification | CUB-200-2011 | Two Branch Network (Text - Bert + Image - Nts-Net) | Accuracy | 96.81 | # 1 | |
Document Text Classification | CUB-200-2011 | Bert | Accuracy | 65.0 | # 1 | |
Multimodal Deep Learning | CUB-200-2011 | Two Branch Network (Text - Bert + Image - Nts-Net) | Accuracy | 96.81 | # 1 |