🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

14 dataset results for Video Generation

UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These 101 categories can be classified into 5 types (Body motion, Human-human interactions, Human-object interactions, Playing musical instruments and Sports). The total length of these video clips is over 27 hours. All the videos are collected from YouTube and have a fixed frame rate of 25 FPS with the resolution of 320 × 240.

1,605 PAPERS • 22 BENCHMARKS

Kinetics (Kinetics Human Action Video Dataset)

The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. Each video clip lasts around 10 seconds and is labeled with a single action class. The videos are collected from YouTube.

1,174 PAPERS • 28 BENCHMARKS

MSR-VTT

MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 English sentences by Amazon Mechanical Turks. There are about 29,000 unique words in all captions. The standard splits uses 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing.

519 PAPERS • 7 BENCHMARKS

WebVid

WebVid contains 10 million video clips with captions, sourced from the web. The videos are diverse and rich in their content.

171 PAPERS • 1 BENCHMARK

Kinetics-600

The Kinetics-600 is a large-scale action recognition dataset which consists of around 480K videos from 600 action categories. The 480K videos are divided into 390K, 30K, 60K for training, validation and test sets, respectively. Each video in the dataset is a 10-second clip of action moment annotated from raw YouTube video. It is an extensions of the Kinetics-400 dataset.

128 PAPERS • 7 BENCHMARKS

LAION-400M

LAION-400M is a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

128 PAPERS • 1 BENCHMARK

How2Sign (A Large-scale Multimodal Dataset for Continuous American Sign Language)

The How2Sign is a multimodal and multiview continuous American Sign Language (ASL) dataset consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. A three-hour subset was further recorded in the Panoptic studio enabling detailed 3D pose estimation.

28 PAPERS • 3 BENCHMARKS

BAIR Robot Pushing

Dataset of 64x64 images of a robot pushing objects on a table top. From Berkeley AI Research (BAIR).

26 PAPERS • 2 BENCHMARKS

InternVid

InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodAL understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.

13 PAPERS • NO BENCHMARKS YET

YouTube Driving

YouTube Driving Dataset contains a massive amount of real-world driving frames with various conditions, from different weather, different regions, to diverse scene types

7 PAPERS • 1 BENCHMARK

CelebV-Text

CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy. The provided texts describes both static and dynamic attributes precisely.

4 PAPERS • NO BENCHMARKS YET

Deep Fakes Dataset (inamibora)

The Deep Fakes Dataset is a collection of "in the wild" portrait videos for deepfake detection. The videos in the dataset are diverse real-world samples in terms of the source generative model, resolution, compression, illumination, aspect-ratio, frame rate, motion, pose, cosmetics, occlusion, content, and context. They originate from various sources such as news articles, forums, apps, and research presentations; totalling up to 142 videos, 32 minutes, and 17 GBs. Synthetic videos are matched with their original counterparts when possible.

3 PAPERS • NO BENCHMARKS YET

QST

QST (Quick Sky Time)

QST contains 1,167 video clips that are cut out from 216 time-lapse 4K videos collected from YouTube, which can be used for a variety of tasks, such as (high-resolution) video generation, (high-resolution) video prediction, (high-resolution) image generation, texture generation, image inpainting, image/video super-resolution, image/video colorization, image/video animating, etc. Each short clip contains multiple frames (from a minimum of 58 frames to a maximum of 1,200 frames, a total of 285,446 frames), and the resolution of each frame is more than 1,024 x 1,024. Specifically, QST consists of a training set (containing 1000 clips, totally 244,930 frames), a validation set (containing 100 clips, totally 23,200 frames), and a testing set (containing 67 clips, totally 17,316 frames). Click here (Key: qst1) to download the QST dataset.

2 PAPERS • NO BENCHMARKS YET

TLFM dataset (TLFM dataset for microscopy image sequence generation)

TLFM dataset structured in sequences of at least nine timesteps. The dataset includes 9696 images of both brightfield and green fluorescent protein channels at a resolution of 256 × 256. Dataset for multi-domain (BF and GFP) microscopy image sequence generation.

1 PAPER • 1 BENCHMARK

Datasets

14 dataset results for Video Generation