Zero-Shot Video Question Answer

34 papers with code • 12 benchmarks • 11 datasets

This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.

Benchmarks

Add a Result

These leaderboards are used to track progress in Zero-Shot Video Question Answer

Dataset	Best Model	Compare
MSRVTT-QA	PLLaVA	See all
ActivityNet-QA	PLLaVA	See all
MSVD-QA	PLLaVA	See all
NExT-QA	VideoAgent (GPT-4)	See all
EgoSchema (fullset)	LLoVi (GPT-3.5)	See all
TGIF-QA	PLLaVA	See all
TVQA	FrozenBiLM (with speech)	See all
STAR Benchmark	VideoChat2	See all
IntentQA	IG-VLM	See all
EgoSchema (subset)	LangRepo (12B)	See all
NExT-GQA	LLoVi (GPT-4)	See all
STAR Benchmark	VideoChat2	See all

Show all 12 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Zero-Shot Video Question Answer models and implementations

pku-yuangroup/video-bench

2 papers

Datasets

Most implemented papers

Most implemented Social Latest No code

Flamingo: a Visual Language Model for Few-Shot Learning

mlfoundations/open_flamingo • • DeepMind 2022

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.

Paper
Code

Mistral 7B

mistralai/mistral-src • • 10 Oct 2023

We introduce Mistral 7B v0. 1, a 7-billion-parameter language model engineered for superior performance and efficiency.

Paper
Code

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

PKU-YuanGroup/Video-LLaVA • • 16 Nov 2023

In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.

Paper
Code

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

zrrskywalker/llama-adapter • 28 Apr 2023

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

Paper
Code

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

ahjeongseo/MASN-pytorch • • CVPR 2017

In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways.

Paper
Code

MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks

wuyuejinxia/prcv2019-mvb-renet • • 26 Jul 2019

Second, all baggage images are captured by specially-designed multi-view camera system to handle pose variation and occlusion, in order to obtain the 3D information of baggage surface as complete as possible.

Paper
Code

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

antoyang/FrozenBiLM • • 16 Jun 2022

Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.

Paper
Code

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

pku-yuangroup/chat-univi • • 14 Nov 2023

Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.

Paper
Code

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

dvlab-research/llama-vid • • 28 Nov 2023

Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens.

Paper
Code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo2 • 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

Paper
Code

Zero-Shot Video Question Answer

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result