Zero-Shot Video Question Answer

54 papers with code • 15 benchmarks • 15 datasets

This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.

Libraries

Use these libraries to find Zero-Shot Video Question Answer models and implementations

Most implemented papers

Mistral 7B

mistralai/mistral-src 10 Oct 2023

We introduce Mistral 7B v0. 1, a 7-billion-parameter language model engineered for superior performance and efficiency.

Flamingo: a Visual Language Model for Few-Shot Learning

mlfoundations/open_flamingo DeepMind 2022

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

PKU-YuanGroup/Video-LLaVA 16 Nov 2023

In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

zrrskywalker/llama-adapter 28 Apr 2023

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

opengvlab/ask-anything CVPR 2024

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

VILA: On Pre-training for Visual Language Models

mit-han-lab/llm-awq CVPR 2024

Visual language models (VLMs) rapidly progressed with the recent success of large language models.

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

ahjeongseo/MASN-pytorch CVPR 2017

In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways.

MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks

wuyuejinxia/prcv2019-mvb-renet 26 Jul 2019

Second, all baggage images are captured by specially-designed multi-view camera system to handle pose variation and occlusion, in order to obtain the 3D information of baggage surface as complete as possible.

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

antoyang/FrozenBiLM 16 Jun 2022

Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

opengvlab/internvideo 6 Dec 2022

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.