UCF101 dataset is an extension of UCF50 and consists of 13,320 video clips, which are classified into 101 categories. These 101 categories can be classified into 5 types (Body motion, Human-human interactions, Human-object interactions, Playing musical instruments and Sports). The total length of these video clips is over 27 hours. All the videos are collected from YouTube and have a fixed frame rate of 25 FPS with the resolution of 320 × 240.
831 PAPERS • 10 BENCHMARKS
The HMDB51 dataset is a large collection of realistic videos from various sources, including movies and web videos. The dataset is composed of 6,849 video clips from 51 action categories (such as “jump”, “kiss” and “laugh”), with each category containing at least 101 clips. The original evaluation scheme uses three different training/testing splits. In each split, each action class has 70 clips for training and 30 clips for testing. The average accuracy over these three splits is used to measure the final performance.
399 PAPERS • 11 BENCHMARKS
The Human3.6M dataset is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding images captured by a high-speed motion capture system. There are 4 high-resolution progressive scan cameras to acquire video data at 50 Hz. The dataset contains activities by 11 professional actors in 17 scenarios: discussion, smoking, taking photo, talking on the phone, etc., as well as provides accurate 3D joint positions and high-resolution videos.
300 PAPERS • 9 BENCHMARKS
Object Tracking Benchmark (OTB) is a visual tracking benchmark that is widely used to evaluate the performance of a visual tracking algorithm. The dataset contains a total of 100 sequences and each is annotated frame-by-frame with bounding boxes and 11 challenge attributes. OTB-2013 dataset contains 51 sequences and the OTB-2015 dataset contains all 100 sequences of the OTB dataset.
285 PAPERS • 3 BENCHMARKS
The Densely Annotation Video Segmentation dataset (DAVIS) is a high quality and high resolution densely annotated video segmentation dataset under two resolutions, 480p and 1080p. There are 50 video sequences with 3455 densely annotated frames in pixel level. 30 videos with 2079 frames are for training and 20 videos with 1376 frames are for validation.
281 PAPERS • 8 BENCHMARKS
The ActivityNet dataset contains 200 different types of activities and a total of 849 hours of videos collected from YouTube. ActivityNet is the largest benchmark for temporal activity detection to date in terms of both the number of activity categories and number of videos, making the task particularly challenging. Version 1.3 of the dataset contains 19994 untrimmed videos in total and is divided into three disjoint subsets, training, validation, and testing by a ratio of 2:1:1. On average, each activity category has 137 untrimmed videos. Each video on average has 1.41 activities which are annotated with temporal boundaries. The ground-truth annotations of test videos are not public.
267 PAPERS • 8 BENCHMARKS
NTU RGB+D is a large-scale dataset for RGB-D human action recognition. It involves 56,880 samples of 60 action classes collected from 40 subjects. The actions can be generally divided into three categories: 40 daily actions (e.g., drinking, eating, reading), nine health-related actions (e.g., sneezing, staggering, falling down), and 11 mutual actions (e.g., punching, kicking, hugging). These actions take place under 17 different scene conditions corresponding to 17 video sequences (i.e., S001–S017). The actions were captured using three cameras with different horizontal imaging viewpoints, namely, −45∘,0∘, and +45∘. Multi-modality information is provided for action characterization, including depth maps, 3D skeleton joint position, RGB frames, and infrared sequences. The performance evaluation is performed by a cross-subject test that split the 40 subjects into training and test groups, and by a cross-view test that employed one camera (+45∘) for testing, and the other two cameras for
197 PAPERS • 13 BENCHMARKS
The Charades dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving interactions with 46 objects classes in 15 types of indoor scenes and containing a vocabulary of 30 verbs leading to 157 action classes. Each video in this dataset is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacting objects. 267 different users were presented with a sentence, which includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence. In total, the dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. In the standard split there are7,986 training video and 1,863 validation video.
188 PAPERS • 4 BENCHMARKS
Multimodal Emotion Recognition IEMOCAP The IEMOCAP dataset consists of 151 videos of recorded dialogues, with 2 speakers per session for a total of 302 videos across the dataset. Each segment is annotated for the presence of 9 emotions (angry, excited, fear, sad, surprised, frustrated, happy, disappointed and neutral) as well as valence, arousal and dominance. The dataset is recorded across 5 sessions with 5 pairs of speakers.
178 PAPERS • 3 BENCHMARKS
The efforts to create a non-trivial and publicly available dataset for action recognition was initiated at the KTH Royal Institute of Technology in 2004. The KTH dataset is one of the most standard datasets, which contains six actions: walk, jog, run, box, hand-wave, and hand clap. To account for performance nuance, each action is performed by 25 different individuals, and the setting is systematically altered for each action per actor. Setting variations include: outdoor (s1), outdoor with scale variation (s2), outdoor with different clothes (s3), and indoor (s4). These variations test the ability of each algorithm to identify actions independent of the background, appearance of the actors, and the scale of the actors.
171 PAPERS • 1 BENCHMARK
The THUMOS14 dataset is a large-scale video dataset that includes 1,010 videos for validation and 1,574 videos for testing from 20 classes. Among all the videos, there are 220 and 212 videos with temporal annotations in validation and testing set, respectively.
163 PAPERS • 8 BENCHMARKS
The Extended Cohn-Kanade (CK+) dataset contains 593 video sequences from a total of 123 different subjects, ranging from 18 to 50 years of age with a variety of genders and heritage. Each video shows a facial shift from the neutral expression to a targeted peak expression, recorded at 30 frames per second (FPS) with a resolution of either 640x490 or 640x480 pixels. Out of these videos, 327 are labelled with one of seven expression classes: anger, contempt, disgust, fear, happiness, sadness, and surprise. The CK+ database is widely regarded as the most extensively used laboratory-controlled facial expression classification database available, and is used in the majority of facial expression classification methods.
151 PAPERS • 2 BENCHMARKS
JHMDB is an action recognition dataset that consists of 960 video sequences belonging to 21 actions. It is a subset of the larger HMDB51 dataset collected from digitized movies and YouTube videos. The dataset contains video and annotation for puppet flow per frame (approximated optimal flow on the person), puppet mask per frame, joint positions per frame, action label per clip and meta label per clip (camera motion, visible body parts, camera viewpoint, number of people, video quality).
146 PAPERS • 6 BENCHMARKS
OTB-2015, also referred as Visual Tracker Benchmark, is a visual tracking dataset. It contains 100 commonly used video sequences for evaluating visual tracking. Image Source: http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html
144 PAPERS • 1 BENCHMARK
CamVid (Cambridge-driving Labeled Video Database) is a road/driving scene understanding database which was originally captured as five video sequences with a 960×720 resolution camera mounted on the dashboard of a car. Those sequences were sampled (four of them at 1 fps and one at 15 fps) adding up to 701 frames. Those stills were manually annotated with 32 classes: void, building, wall, tree, vegetation, fence, sidewalk, parking block, column/pole, traffic cone, bridge, sign, miscellaneous text, traffic light, sky, tunnel, archway, road, road shoulder, lane markings (driving), lane markings (non-driving), animal, pedestrian, child, cart luggage, bicyclist, motorcycle, car, SUV/pickup/truck, truck/bus, train, and other moving object
139 PAPERS • 3 BENCHMARKS
DAVIS16 is a dataset for video object segmentation which consists of 50 videos in total (30 videos for training and 20 for testing). Per-frame pixel-wise annotations are offered.
139 PAPERS • 5 BENCHMARKS
MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 English sentences by Amazon Mechanical Turks. There are about 29,000 unique words in all captions. The standard splits uses 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing.
130 PAPERS • 4 BENCHMARKS
The Sports-1M dataset consists of over a million videos from YouTube. The videos in the dataset can be obtained through the YouTube URL specified by the authors. Approximately 7% (as of 2016) of the videos have been removed by the YouTube uploaders since the dataset was compiled. However, there are still over a million videos in the dataset with 487 sports-related categories with 1,000 to 3,000 videos per category. The videos are automatically labelled with 487 sports classes using the YouTube Topics API by analyzing the text metadata associated with the videos (e.g. tags, descriptions). Approximately 5% of the videos are annotated with more than one class.
129 PAPERS • 1 BENCHMARK
MARS (Motion Analysis and Re-identification Set) is a large scale video based person reidentification dataset, an extension of the Market-1501 dataset. It has been collected from six near-synchronized cameras. It consists of 1,261 different pedestrians, who are captured by at least 2 cameras. The variations in poses, colors and illuminations of pedestrians, as well as the poor image quality, make it very difficult to yield high matching accuracy. Moreover, the dataset contains 3,248 distractors in order to make it more realistic. Deformable Part Model and GMMCP tracker were used to automatically generate the tracklets (mostly 25-50 frames long).
120 PAPERS • 1 BENCHMARK
VoxCeleb2 is a large scale speaker recognition dataset obtained automatically from open-source media. VoxCeleb2 consists of over a million utterances from over 6k speakers. Since the dataset is collected ‘in the wild’, the speech segments are corrupted with real world noise including laughter, cross-talk, channel effects, music and other sounds. The dataset is also multilingual, with speech from speakers of 145 different nationalities, covering a wide range of accents, ages, ethnicities and languages. The dataset is audio-visual, so is also useful for a number of other applications, for example – visual speech synthesis, speech separation, cross-modal transfer from face to voice or vice versa and training face recognition from video to complement existing face recognition datasets.
119 PAPERS • 3 BENCHMARKS
The Moving MNIST dataset contains 10,000 video sequences, each consisting of 20 frames. In each video sequence, two digits move independently around the frame, which has a spatial resolution of 64×64 pixels. The digits frequently intersect with each other and bounce off the edges of the frame
112 PAPERS • 1 BENCHMARK
MPII Human Pose Dataset is a dataset for human pose estimation. It consists of around 25k images extracted from online videos. Each image contains one or more people, with over 40k people annotated in total. Among the 40k samples, ∼28k samples are for training and the remainder are for testing. Overall the dataset covers 410 human activities and each image is provided with an activity label. Images were extracted from a YouTube video and provided with preceding and following un-annotated frames.
109 PAPERS • 2 BENCHMARKS
DAVIS17 is a dataset for video object segmentation. It contains a total of 150 videos - 60 for training, 30 for validation, 60 for testing
105 PAPERS • 7 BENCHMARKS
OTB2013 is the previous version of the current OTB2015 Visual Tracker Benchmark. It contains only 50 tracking sequences, as opposed to the 100 sequences in the current version of the benchmark.
104 PAPERS • 1 BENCHMARK
The Microsoft Research Video Description Corpus (MSVD) dataset consists of about 120K sentences collected during the summer of 2010. Workers on Mechanical Turk were paid to watch a short video snippet and then summarize the action in a single sentence. The result is a set of roughly parallel descriptions of more than 2,000 video snippets. Because the workers were urged to complete the task in the language of their choice, both paraphrase and bilingual alternations are captured in the data.
101 PAPERS • 3 BENCHMARKS
VOT2016 is a video dataset for visual object tracking. It contains 60 video clips and 21,646 corresponding ground truth maps with pixel-wise annotation of salient objects.
95 PAPERS • 1 BENCHMARK
The MOTChallenge datasets are designed for the task of multiple object tracking. There are several variants of the dataset released each year, such as MOT15, MOT17, MOT20.
94 PAPERS • 6 BENCHMARKS
MSMT17 is a multi-scene multi-time person re-identification dataset. The dataset consists of 180 hours of videos, captured by 12 outdoor cameras, 3 indoor cameras, and during 12 time slots. The videos cover a long period of time and present complex lighting variations, and it contains a large number of annotated identities, i.e., 4,101 identities and 126,441 bounding boxes.
94 PAPERS • 7 BENCHMARKS
FaceForensics++ is a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation methods: Deepfakes, Face2Face, FaceSwap and NeuralTextures. The data has been sourced from 977 youtube videos and all videos contain a trackable mostly frontal face without occlusions which enables automated tampering methods to generate realistic forgeries.
92 PAPERS • 1 BENCHMARK
The UCSD Anomaly Detection Dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. The crowd density in the walkways was variable, ranging from sparse to very crowded. In the normal setting, the video contains only pedestrians. Abnormal events are due to either: the circulation of non pedestrian entities in the walkways anomalous pedestrian motion patterns Commonly occurring anomalies include bikers, skaters, small carts, and people walking across a walkway or in the grass that surrounds it. A few instances of people in wheelchair were also recorded. All abnormalities are naturally occurring, i.e. they were not staged for the purposes of assembling the dataset. The data was split into 2 subsets, each corresponding to a different scene. The video footage recorded from each scene was split into various clips of around 200 frames.
92 PAPERS • 2 BENCHMARKS
MPI-INF-3DHP is a 3D human body pose estimation dataset consisting of both constrained indoor and complex outdoor scenes. It records 8 actors performing 8 activities from 14 camera views. It consists on >1.3M frames captured from the 14 cameras.
90 PAPERS • 2 BENCHMARKS
The MOT16 dataset is a dataset for multiple object tracking. It a collection of existing and new data (part of the sources are from and ), containing 14 challenging real-world videos of both static scenes and moving scenes, 7 for training and 7 for testing. It is a large-scale dataset, composed of totally 110407 bounding boxes in training set and 182326 bounding boxes in test set. All video sequences are annotated under strict standards, their ground-truths are highly accurate, making the evaluation meaningful.
89 PAPERS • 2 BENCHMARKS
The YouTube-8M dataset is a large scale video dataset, which includes more than 7 million videos with 4716 classes labeled by the annotation system. The dataset consists of three parts: training set, validate set, and test set. In the training set, each class contains at least 100 training videos. Features of these videos are extracted by the state-of-the-art popular pre-trained models and released for public use. Each video contains audio and visual modality. Based on the visual information, videos are divided into 24 topics, such as sports, game, arts & entertainment, etc
89 PAPERS • 2 BENCHMARKS
LaSOT is a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT one of the largest densely annotated tracking benchmark. The average video length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view.
88 PAPERS • 1 BENCHMARK
The GoPro dataset for deblurring consists of 3,214 blurred images with the size of 1,280×720 that are divided into 2,103 training images and 1,111 test images. The dataset consists of pairs of a realistic blurry image and the corresponding ground truth shapr image that are obtained by a high-speed camera.
86 PAPERS • 1 BENCHMARK
VOT2018 is a dataset for visual object tracking. It consists of 60 challenging videos collected from real-life datasets.
85 PAPERS • 1 BENCHMARK
The Hopkins 155 dataset consists of 156 video sequences of two or three motions. Each video sequence motion corresponds to a low-dimensional subspace. There are 39−550 data vectors drawn from two or three motions for each video sequence.
84 PAPERS • 1 BENCHMARK
The Multiple Object Tracking 17 (MOT17) dataset is a dataset for multiple object tracking. Similar to its previous version MOT16, this challenge contains seven different indoor and outdoor scenes of public places with pedestrians as the objects of interest. A video for each scene is divided into two clips, one for training and the other for testing. The dataset provides detections of objects in the video frames with three detectors, namely SDP, Faster-RCNN and DPM. The challenge accepts both on-line and off-line tracking approaches, where the latter are allowed to use the future video frames to predict tracks.
83 PAPERS • 2 BENCHMARKS
The IJB-C dataset is a video-based face recognition dataset. It is an extension of the IJB-A dataset with about 138,000 face images, 11,000 face videos, and 10,000 non-face images.
78 PAPERS • 1 BENCHMARK
The Denver Intensity of Spontaneous Facial Action (DISFA) dataset consists of 27 videos of 4844 frames each, with 130,788 images in total. Action unit annotations are on different levels of intensity, which are ignored in the following experiments and action units are either set or unset. DISFA was selected from a wider range of databases popular in the field of facial expression recognition because of the high number of smiles, i.e. action unit 12. In detail, 30,792 have this action unit set, 82,176 images have some action unit(s) set and 48,612 images have no action unit(s) set at all.
75 PAPERS • 1 BENCHMARK
SegTrack v2 is a video segmentation dataset with full pixel-level annotations on multiple objects at each frame within each video.
75 PAPERS • 3 BENCHMARKS
The 3D Poses in the Wild dataset is the first dataset in the wild with accurate 3D poses for evaluation. While other datasets outdoors exist, they are all restricted to a small recording volume. 3DPW is the first one that includes video footage taken from a moving phone camera.
74 PAPERS • 1 BENCHMARK
SURREAL (Synthetic hUmans foR REAL tasks) is a large-scale person dataset that generates photorealistic synthetic images with labeling for human part segmentation and depth estimation, producing 6.5M frames in 67.5K short clips (about 100 frames each) of 2.6K action sequences with 145 different synthetic subjects. To ensure realism, the synthetic bodies are created using the SMPL body model, whose parameters are fit by the MoSh method given raw 3D MoCap marker data.
74 PAPERS • 1 BENCHMARK
The Lip Reading in the Wild (LRW) dataset a large-scale audio-visual database that contains 500 different words from over 1,000 speakers. Each utterance has 29 frames, whose boundary is centered around the target word. The database is divided into training, validation and test sets. The training set contains at least 800 utterances for each class while the validation and test sets contain 50 utterances.
73 PAPERS • 5 BENCHMARKS
Virtual KITTI is a photo-realistic synthetic video dataset designed to learn and evaluate computer vision models for several video understanding tasks: object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation.
73 PAPERS • 1 BENCHMARK
The UCY dataset consist of real pedestrian trajectories with rich multi-human interaction scenarios captured at 2.5 Hz (Δt=0.4s). It is composed of three sequences (Zara01, Zara02, and UCY), taken in public spaces from top-view.
70 PAPERS • NO BENCHMARKS YET
SUN3D contains a large-scale RGB-D video database, with 8 annotated sequences. Each frame has a semantic segmentation of the objects in the scene and information about the camera pose. It is composed by 415 sequences captured in 254 different spaces, in 41 different buildings. Moreover, some places have been captured multiple times at different moments of the day.
69 PAPERS • NO BENCHMARKS YET