TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Temporal Action Proposal Generation	ActivityNet Captions	BMT	Average Precision	48.23	# 1
Temporal Action Proposal Generation	ActivityNet Captions	BMT	Average Recall	80.31	# 1
Temporal Action Proposal Generation	ActivityNet Captions	BMT	Average F1	60.27	# 1
Dense Video Captioning	ActivityNet Captions	BMT	METEOR	8.44	# 9
Dense Video Captioning	ActivityNet Captions	BMT	BLEU-3	3.84	# 2
Dense Video Captioning	ActivityNet Captions	BMT	BLEU-4	1.88	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-better-use-of-audio-visual-cues-dense-video/temporal-action-proposal-generation-on-1)](https://paperswithcode.com/sota/temporal-action-proposal-generation-on-1?p=a-better-use-of-audio-visual-cues-dense-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-better-use-of-audio-visual-cues-dense-video/dense-video-captioning-on-activitynet)](https://paperswithcode.com/sota/dense-video-captioning-on-activitynet?p=a-better-use-of-audio-visual-cues-dense-video)`

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

17 May 2020 · Vladimir Iashin, Esa Rahtu ·

Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task. We also show that the pre-trained bi-modal encoder as a part of the bi-modal transformer can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance. The code is available: v-iashin.github.io/bmt

PDF Abstract

Code

Add Remove Mark official

v-iashin/BMT official

↳ Quickstart in

Colab

220

v-iashin/video_features

↳ Quickstart in

Colab

432

Tasks

Add Remove

Dense Video Captioning

Temporal Action Proposal Generation

Video Captioning

Datasets

ActivityNet Captions

Results from the Paper

Edit

Ranked #1 on Temporal Action Proposal Generation on ActivityNet Captions

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Temporal Action Proposal Generation	ActivityNet Captions	BMT	Average Precision	48.23	# 1	Compare
			Average Recall	80.31	# 1	Compare
			Average F1	60.27	# 1	Compare
Dense Video Captioning	ActivityNet Captions	BMT	METEOR	8.44	# 9	Compare
			BLEU-3	3.84	# 2	Compare
			BLEU-4	1.88	# 4	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove