A Joint Sequence Fusion Model for Video Question Answering and Retrieval

ECCV 2018 Youngjae YuJongseok KimGunhee Kim

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components... (read more)

PDF Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT LEADERBOARD
Video Retrieval LSMDC JSFusion text-to-video [email protected] 9.1 # 3
text-to-video [email protected] 21.2 # 3
text-to-video [email protected] 34.1 # 3
text-to-video Median Rank 36 # 3
Video Retrieval MSR-VTT JSFusion text-to-video [email protected] 10.2 # 2
text-to-video [email protected] 43.2 # 2
text-to-video Median Rank 13 # 2
video-to-text [email protected] 31.2 # 4