A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

17 May 2020Vladimir IashinEsa Rahtu

Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Temporal Action Proposal Generation ActivityNet Captions BMT Average Precision 48.23 # 1
Average Recall 80.31 # 1
Average F1 60.27 # 1
Dense Video Captioning ActivityNet Captions BMT METEOR 8.44 # 1
BLEU-3 3.84 # 1
BLEU-4 1.88 # 1

Methods used in the Paper