Videos

UnAV-100

Introduced by Geng et al. in Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events covering 100 event categories. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. We believe our UnAV-100, with its realistic complexity, can promote the exploration on comprehensive audio-visual video understanding.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Trend	Task	Dataset Variant	Best Model	Paper	Code
	audio-visual event localization	UnAV-100	UnAV

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

audio-visual event localization

Similar Datasets

MUSIC-AVQA

VideoInstruct

InternVid

ACAV100M

Usage

UnAV-100

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

MUSIC-AVQA

VideoInstruct

InternVid

ACAV100M

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages