Bi-Directional Relationship Inferring Network for Referring Image Segmentation
Most existing methods do not explicitly formulate the mutual guidance between vision and language. In this work, we propose a bi-directional relationship inferring network (BRINet) to model the dependencies of cross-modal information. In detail, the vision-guided linguistic attention is used to learn the adaptive linguistic context corresponding to each visual region. Combining with the language-guided visual attention, a bi-directional cross-modal attention module (BCAM) is built to learn the relationship between multi-modal features. Thus, the ultimate semantic context of the target object and referring expression can be represented accurately and consistently. Moreover, a gated bi-directional fusion module (GBFM) is designed to integrate the multi-level features where a gate function is used to guide the bi-directional flow of multi-level information. Extensive experiments on four benchmark datasets demonstrate that the proposed method outperforms other state-of-the-art methods under different evaluation metrics.
PDF AbstractResults from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Referring Expression Segmentation | RefCOCO testA | BRINet | Overall IoU | 63.37 | # 18 | |
Referring Expression Segmentation | RefCOCO+ testA | BRINet | Overall IoU | 52.87 | # 16 | |
Referring Expression Segmentation | RefCOCO testB | BRINet | Overall IoU | 59.57 | # 15 | |
Referring Expression Segmentation | RefCOCO+ test B | BRINet | Overall IoU | 42.13 | # 16 | |
Referring Expression Segmentation | RefCoCo val | BRINet | Overall IoU | 61.35 | # 19 | |
Referring Expression Segmentation | RefCOCO+ val | BRINet | Overall IoU | 48.57 | # 18 |