ViTT (Video Timeline Tags)

Introduced by Huang et al. in Multimodal Pretraining for Dense Video Captioning

The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more. A total of 12,461 sets of annotations are released. The videos in the dataset are from the Youtube-8M dataset.

An annotation has the following format:

{
  "id": "FmTp",
  "annotations": [
    {
      "timestamp": 260,
      "tag": "Opening"
    },
    {
      "timestamp": 16000,
      "tag": "Displaying technique"
    },
    {
      "timestamp": 23990,
      "tag": "Showing foot positioning"
    },
    {
      "timestamp": 55530,
      "tag": "Demonstrating crossover"
    },
    {
      "timestamp": 114100,
      "tag": "Closing"
    }
  ]
}

Source: Video Timeline Tags (ViTT)

Homepage

Benchmarks

Add a new result Link an existing benchmark

Trend	Task	Dataset Variant	Best Model	Paper	Code
	Dense Video Captioning	ViTT	Vid2Seq
	Zero-shot dense video captioning	ViTT	Vid2Seq

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

google-research-datasets/Video-Timeline-Tags-ViTT

Tasks

Similar Datasets

VideoXum

ViTT (Video Timeline Tags)

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

VideoXum

Youku-mPLUG

MAD

YouCook

Usage

License

Modalities

Languages

ViTT (Video Timeline Tags)

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

VideoXum

Youku-mPLUG

MAD

YouCook

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages