TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Zero-Shot Video Question Answer	MSRVTT-QA	Omni-VideoAssistant	Accuracy	55.3	# 11
Zero-Shot Video Question Answer	MSRVTT-QA	Omni-VideoAssistant	Confidence Score	3.3	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/omnidatacomposer-a-unified-data-structure-for/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=omnidatacomposer-a-unified-data-structure-for)`

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

8 Aug 2023 · Dongyang Yu, Shihao Wang, Yuan Fang, Wangpeng An ·

This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities. Coming to the core breakthrough, it introduces a cohesive data structure proficient in processing and merging multimodal data inputs, which include video, audio, and text. Our crafted algorithm leverages advancements across multiple operations such as video/image caption extraction, dense caption extraction, Automatic Speech Recognition (ASR), Optical Character Recognition (OCR), Recognize Anything Model(RAM), and object tracking. OmniDataComposer is capable of identifying over 6400 categories of objects, substantially broadening the spectrum of visual information. It amalgamates these diverse modalities, promoting reciprocal enhancement among modalities and facilitating cross-modal data correction. \textbf{The final output metamorphoses each video input into an elaborate sequential document}, virtually transmuting videos into thorough narratives, making them easier to be processed by large language models. Future prospects include optimizing datasets for each modality to encourage unlimited data generation. This robust base will offer priceless insights to models like ChatGPT, enabling them to create higher quality datasets for video captioning and easing question-answering tasks based on video content. OmniDataComposer inaugurates a new stage in multimodal learning, imparting enormous potential for augmenting AI's understanding and generation of complex, real-world data.

PDF Abstract

Code

Add Remove Mark official

shajiayu1/OmniDataComposer official

Tasks

Add Remove

Automatic Speech Recognition

Automatic Speech Recognition (ASR)

Object Tracking

Optical Character Recognition

Optical Character Recognition (OCR)

Question Answering

speech-recognition

Speech Recognition

Video Captioning

Zero-Shot Video Question Answer

Datasets

MSRVTT-QA

Results from the Paper

Add Remove

Ranked #11 on Zero-Shot Video Question Answer on MSRVTT-QA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	MSRVTT-QA	Omni-VideoAssistant	Accuracy	55.3	# 11	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	Omni-VideoAssistant	Confidence Score	3.3	# 6	Compare

Methods

Add Remove

BASE

Edit Social Preview

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove