no code implementations • 30 Mar 2022 • Guang Feng, Lihe Zhang, Zhiwei Hu, Huchuan Lu
To address this task, we first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically, and a vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
Ranked #3 on Referring Expression Segmentation on J-HMDB
no code implementations • CVPR 2021 • Guang Feng, Zhiwei Hu, Lihe Zhang, Huchuan Lu
In this work, we propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network, and uses language to refine the multi-modal features progressively.