LAMV: Learning to Align and Match Videos With Kernelized Temporal Layers

This paper considers a learnable approach for comparing and aligning videos. Our architecture builds upon and revisits temporal match kernels within neural networks: we propose a new temporal layer that finds temporal alignments by maximizing the scores between two sequences of vectors, according to a time-sensitive similarity metric parametrized in the Fourier domain. We learn this layer with a temporal proposal strategy, in which we minimize a triplet loss that takes into account both the localization accuracy and the recognition rate. We evaluate our approach on video alignment, copy detection and event retrieval. Our approach outperforms the state on the art on temporal video alignment and video copy detection datasets in comparable setups. It also attains the best reported results for particular event search, while precisely aligning videos.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Video Retrieval FIVR-200K LAMV mAP (ISVR) 0.371 # 17
mAP (DSVR) 0.496 # 16
mAP (CSVR) 0.466 # 16
Video Alignment MSU Video Alignment and Retrieval Benchmark Suite TMK Accuracy w/ 3 frames error (Light) 0.0571 # 4
Accuracy w/ 3 frames error (Medium geometric) 0.0446 # 4
Accuracy w/ 3 frames error (Medium color) 0.0607 # 4
Accuracy w/ 3 frames error (Hard) 0.0554 # 3

Methods


No methods listed for this paper. Add relevant methods here