Zero-Shot Video Question Answer
54 papers with code • 15 benchmarks • 15 datasets
This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.
Libraries
Use these libraries to find Zero-Shot Video Question Answer models and implementationsMost implemented papers
Mistral 7B
We introduce Mistral 7B v0. 1, a 7-billion-parameter language model engineered for superior performance and efficiency.
Flamingo: a Visual Language Model for Few-Shot Learning
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.
VILA: On Pre-training for Visual Language Models
Visual language models (VLMs) rapidly progressed with the recent success of large language models.
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways.
MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks
Second, all baggage images are captured by specially-designed multi-view camera system to handle pose variation and occlusion, in order to obtain the 3D information of baggage surface as complete as possible.
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.