no code implementations • 16 Jun 2020 • Zerun Feng, Zhimin Zeng, Caili Guo, Zheng Li
Finally, the region features are aggregated to form frame-level features for further encoding to measure video-text similarity.
Retrieval Text Retrieval +2