no code implementations • 10 Nov 2021 • Zijian Gao, Jingyu Liu, Weiqi Sun, Sheng Chen, Dedan Chang, Lili Zhao
Modern video-text retrieval frameworks basically consist of three parts: video encoder, text encoder and the similarity head.
Ranked #12 on Video Retrieval on MSR-VTT-1kA (using extra training data)