no code implementations • 20 Feb 2023 • Litian Zhang, XiaoMing Zhang, Ziming Guo, Zhipeng Liu
Then, the visual description and text content are fused to generate the textual summary to capture the semantics of the multimodal content, and the most relevant image is selected as the visual summary.