TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	MSRVTT-QA	LRCE	Accuracy	0.42	# 21
Visual Question Answering (VQA)	MSVD-QA	LRCE	Accuracy	0.478	# 25
TGIF-Transition	TGIF-QA	LRCE	Accuracy	87.9	# 7
TGIF-Action	TGIF-QA	LRCE	Accuracy	84.4	# 7
TGIF-Frame	TGIF-QA	LRCE	Accuracy	68.8	# 11

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lightweight-recurrent-cross-modal-encoder-for/tgif-transition-on-tgif-qa)](https://paperswithcode.com/sota/tgif-transition-on-tgif-qa?p=lightweight-recurrent-cross-modal-encoder-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lightweight-recurrent-cross-modal-encoder-for/tgif-action-on-tgif-qa)](https://paperswithcode.com/sota/tgif-action-on-tgif-qa?p=lightweight-recurrent-cross-modal-encoder-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lightweight-recurrent-cross-modal-encoder-for/tgif-frame-on-tgif-qa)](https://paperswithcode.com/sota/tgif-frame-on-tgif-qa?p=lightweight-recurrent-cross-modal-encoder-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lightweight-recurrent-cross-modal-encoder-for/visual-question-answering-on-msrvtt-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msrvtt-qa-1?p=lightweight-recurrent-cross-modal-encoder-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lightweight-recurrent-cross-modal-encoder-for/visual-question-answering-on-msvd-qa-1)](https://paperswithcode.com/sota/visual-question-answering-on-msvd-qa-1?p=lightweight-recurrent-cross-modal-encoder-for)`

Lightweight Recurrent Cross-modal Encoder for Video Question Answering

Knowledge-Based Systems 2023 · Steve Andreas Immanuel, Cheol Jeong ·

A video question answering task essentially boils down to how to fuse the information between text and video effectively to predict an answer. Most works employ a transformer encoder as a cross-modal encoder to fuse both modalities by leveraging the full self-attention mechanism. Due to the high computational cost of the self-attention and the high dimensional data of video, they either have to settle for: 1) only training the cross-modal encoder on offline-extracted video and text features or 2) training the cross-modal encoder with the video and text feature extractor, but only using sparsely-sampled video frames. Training only from offline-extracted features suffers from the disconnection between the extracted features and the data of the downstream task because the video and text feature extractors are trained independently on different domains, e.g., action recognition for the video feature extractor and semantic classification for the text feature extractor. Training using sparsely-sampled video frames might suffer from information loss if the video contains very rich information or has many frames. To alleviate those issues, we propose Lightweight Recurrent Cross-modal Encoder (LRCE) that replaces the self-attention operation with a single learnable special token to summarize the text and video features. As a result, our model incurs a significantly lower computational cost. Additionally, we perform a novel multi-segment sampling which sparsely samples the video frames from different segments of the video to provide more fine-grained information. Through extensive experiments on three VideoQA datasets, we demonstrate the LRCE achieves significant performance gains compared to previous works.

PDF

Code

Add Remove Mark official

Sejong-VLI/VQA-LRCE-KBS-2023

Tasks

Add Remove

Action Recognition

Question Answering

TGIF-Action

TGIF-Frame

TGIF-Transition

Video Question Answering

Visual Question Answering (VQA)

Datasets

TGIF-QA MSRVTT-QA MSVD-QA

Results from the Paper

Add Remove

Ranked #7 on TGIF-Transition on TGIF-QA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	MSRVTT-QA	LRCE	Accuracy	0.42	# 21	Compare
Visual Question Answering (VQA)	MSVD-QA	LRCE	Accuracy	0.478	# 25	Compare
TGIF-Transition	TGIF-QA	LRCE	Accuracy	87.9	# 7	Compare
TGIF-Action	TGIF-QA	LRCE	Accuracy	84.4	# 7	Compare
TGIF-Frame	TGIF-QA	LRCE	Accuracy	68.8	# 11	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Lightweight Recurrent Cross-modal Encoder for Video Question Answering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove