Video Instruction Dataset is used to train Video-ChatGPT. It consists of 100,000 high-quality video instruction pairs. employs a combination of human-assisted and semi-automatic annotation techniques, aiming to produce high-quality video instruction data. These methods create question-answer pairs related to
16 PAPERS • 6 BENCHMARKS
ACID consists of thousands of aerial drone videos of different coastline and nature scenes on YouTube. Structure-from-motion is used to get camera poses.
15 PAPERS • 2 BENCHMARKS
AIST++ is a 3D dance dataset which contains 3D motion reconstructed from real dancers paired with music. The AIST++ Dance Motion Dataset is constructed from the AIST Dance Video DB. With multi-view videos, an elaborate pipeline is designed to estimate the camera parameters, 3D human keypoints and 3D human dance motion sequences:
ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe.
15 PAPERS • NO BENCHMARKS YET
CASIA-B is a large multiview gait database, which is created in January 2005. There are 124 subjects, and the gait data was captured from 11 views. Three variations, namely view angle, clothing and carrying condition changes, are separately considered. Besides the video files, we still provide human silhouettes extracted from video files. The detailed information about Dataset B and an evaluation framework can be found in this paper .
15 PAPERS • 1 BENCHMARK
We construct a dataset named CPED from 40 Chinese TV shows. CPED consists of multisource knowledge related to empathy and personal characteristic. This knowledge covers 13 emotions, gender, Big Five personality traits, 19 dialogue acts and other knowledge.
15 PAPERS • 3 BENCHMARKS
Casual Conversations dataset is designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions.
The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality (AR) -motivated multi-sensor egocentric world view. The dataset contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head and face bounding boxes and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.
15 PAPERS • 4 BENCHMARKS
The FIVR-200K dataset has been collected to simulate the problem of Fine-grained Incident Video Retrieval (FIVR). The dataset comprises 225,960 videos associated with 4,687 Wikipedia events and 100 selected video queries.
FineAction contains 103K temporal instances of 106 action categories, annotated in 17K untrimmed videos. FineAction introduces new opportunities and challenges for temporal action localization, thanks to its distinct characteristics of fine action classes with rich diversity, dense annotations of multiple instances, and co-occurring actions of different classes.
TV show Caption is a large-scale multimodal captioning dataset, containing 261,490 caption descriptions paired with 108,965 short video moments. TVC is unique as its captions may also describe dialogues/subtitles while the captions in the other datasets are only describing the visual content.
V2X-Sim, short for vehicle-to-everything simulation, is the a synthetic collaborative perception dataset in autonomous driving developed by AI4CE Lab at NYU and MediaBrain Group at SJTU to facilitate collaborative perception between multiple vehicles and roadside infrastructure. Data is collected from both roadside and vehicles when they are presented near the same intersection. With information from both the roadside infrastructure and vehicles, the dataset aims to encourage research on collaborative perception tasks.
VIPL-HR database is a database for remote heart rate (HR) estimation from face videos under less-constrained situations. It contains 2,378 visible light videos (VIS) and 752 near-infrared (NIR) videos of 107 subjects. Nine different conditions, including various head movements and illumination conditions are taken into consideration. All the videos are recorded using Logitech C310, RealSense F200 and the front camera of HUAWEI P9 smartphone, and the ground-truth HR is recorded using a CONTEC CMS60C BVP sensor (a FDA approved device).
4DFAB is a large scale database of dynamic high-resolution 3D faces which consists of recordings of 180 subjects captured in four different sessions spanning over a five-year period (2012 - 2017), resulting in a total of over 1,800,000 3D meshes. It contains 4D videos of subjects displaying both spontaneous and posed facial behaviours. The database can be used for both face and facial expression recognition, as well as behavioural biometrics. It can also be used to learn very powerful blendshapes for parametrising facial behaviour.
14 PAPERS • NO BENCHMARKS YET
The Audio Visual Scene-Aware Dialog (AVSD) dataset, or DSTC7 Track 3, is a audio-visual dataset for dialogue understanding. The goal with the dataset and track was to design systems to generate responses in a dialog about a video, given the dialog history and audio-visual content of the video.
14 PAPERS • 1 BENCHMARK
Animal Kingdom is a large and diverse dataset that provides multiple annotated tasks to enable a more thorough understanding of natural animal behaviors. The wild animal footage used in the dataset records different times of the day in an extensive range of environments containing variations in backgrounds, viewpoints, illumination and weather conditions. More specifically, the dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes.
14 PAPERS • 2 BENCHMARKS
DAiSEE is a multi-label video classification dataset comprising of 9,068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration "in the wild". The dataset has four levels of labels namely - very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists.
EMDB contains in-the-wild videos of human activity recorded with a hand-held iPhone. It features reference SMPL body pose and shape parameters, as well as global body root and camera trajectories. The reference 3D poses were obtained by jointly fitting SMPL to 12 body-worn electromagnetic sensors and image data. For the latter we fit a neural implicit avatar model to allow for a dense pixel-wise fitting objective.
Extreme Pose Interaction (ExPI) Dataset is a new person interaction dataset of Lindy Hop dancing actions. In Lindy Hop, the two dancers are called leader and follower. The authors recorded two couples of dancers in a multi-camera setup equipped also with a motion-capture system. 16 different actions are performed in ExPI dataset, some by the two couples of dancers, some by only one of the couples. Each action was repeated five times to account for variability. More precisely, for each recorded sequence, ExPI provides: (i) Multi-view videos at 25FPS from all the cameras in the recording setup; (ii) Mocap data (3D position of 18 joints for each person) at 25FPS synchronized with the videos.; (iii) camera calibration information; and (iv) 3D shapes as textured meshes for each frame.
Fisheye dataset comprises of synthetically generated fisheye sequences and fisheye video sequences captured with an actual fisheye camera designed for fisheye motion estimation.
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodAL understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
The MECCANO dataset is the first dataset of egocentric videos to study human-object interactions in industrial-like settings. The MECCANO dataset has been acquired in an industrial-like scenario in which subjects built a toy model of a motorbike. We considered 20 object classes which include the 16 classes categorizing the 49 components, the two tools (screwdriver and wrench), the instructions booklet and a partial_model class.
14 PAPERS • 3 BENCHMARKS
Memorability dataset with 10000 3-second videos. Each video has upwards of 90 human annotations, and the split-half consistency of this dataset is 0.73 (best in class for video memorabilty datasets).
The Oxford Radar RobotCar Dataset is a radar extension to The Oxford RobotCar Dataset. It has been extended with data from a Navtech CTS350-X Millimetre-Wave FMCW radar and Dual Velodyne HDL-32E LIDARs with optimised ground truth radar odometry for 280 km of driving around Oxford, UK (in addition to all sensors in the original Oxford RobotCar Dataset).
How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. STAR Benchmark is a novel benchmark for Situated Reasoning, which provides 60K challenging situated questions in four types of tasks, 140K situated hypergraphs, symbolic situation programs, and logic-grounded diagnosis for real-world video situations. (Data Download, STAR Leaderboard)
CH-SIMS is a Chinese single- and multimodal sentiment analysis dataset which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations. It allows researchers to study the interaction between modalities or use independent unimodal annotations for unimodal sentiment analysis.
13 PAPERS • 1 BENCHMARK
CelebV-HQ is a large-scale video facial attributes dataset with annotations. CelebV-HQ contains 35,666 video clips involving 15,653 identities and 83 manually labeled facial attributes covering appearance, action, and emotion.
13 PAPERS • 3 BENCHMARKS
CholecT45 is a subset of CholecT50 consisting of 45 videos from the Cholec80 dataset. It is the first public release of part of CholecT50 dataset. CholecT50 is a dataset of 50 endoscopic videos of laparoscopic cholecystectomy surgery introduced to enable research on fine-grained action recognition in laparoscopic surgery. It is annotated with 100 triplet classes in the form of <instrument, verb, target>.
13 PAPERS • 2 BENCHMARKS
DeepStab is a dataset for online video stabilization consisting of synchronized steady/unsteady video pairs collected via a well designed hand-held hardware.
13 PAPERS • NO BENCHMARKS YET
Everybody Dance Now is a dataset of videos that can be used for training and motion transfer. It contains long single-dancer videos that can be used to train and evaluate the model. All subjects have consented to allowing the data to be used for research purposes.
First-Person Hand Action Benchmark is a collection of RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations.
A large-scale 4D egocentric dataset with rich annotations, to catalyze the research of category-level human-object interaction. HOI4D consists of 2.4M RGB-D egOCentric video frames over 4000 sequences collected by 4 participants interacting with 800 different object instances from 16 categories over 610 different indoor rooms.
No-reference (NR) perceptual video quality assessment (VQA) is a complex, unsolved, and important problem to social and streaming media applications. Efficient and accurate video quality predictors are needed to monitor and guide the processing of billions of shared, often imperfect, user-generated content (UGC). Unfortunately, current NR models are limited in their prediction capabilities on real-world, "in-the-wild" UGC video data. To advance progress on this problem, we created the largest (by far) subjective video quality dataset, containing 39, 000 real-world distorted videos and 117, 000 space-time localized video patches ("v-patches"), and 5.5M human perceptual quality annotations. Using this, we created two unique NR-VQA models: (a) a local-to-global region-based NR VQA architecture (called PVQ) that learns to predict global video quality and achieves state-of-the-art performance on 3 UGC datasets, and (b) a first-of-a-kind space-time video quality mapping engine (called PVQ Ma
LIVE-YT-HFR comprises of 480 videos having 6 different frame rates, obtained from 16 diverse contents.
MAFW is a large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. It contains 10,045 video-audio clips, annotated with a compound emotional category and a couple of sentences that describe the subjects' affective behaviors in the clip. For the compound emotion annotation, each clip is categorized into one or more of the 11 widely-used emotions, i.e., anger, disgust, fear, happiness, neutral, sadness, surprise, contempt, anxiety, helplessness, and disappointment.
MGif is a dataset of videos containing movements of different cartoon animals. Each video is a moving gif file. The dataset consists of 1000 videos. The dataset is particularly challenging because of the high appearance variation and motion diversity.
Spatio-temporal action detection is an important and challenging problem in video understanding. The existing action detection benchmarks are limited in aspects of small numbers of instances in a trimmed video or low-level atomic actions. This paper aims to present a new multi-person dataset of spatio-temporal localized sports actions, coined as MultiSports. We first analyze the important ingredients of constructing a realistic and challenging dataset for spatio-temporal action detection by proposing three criteria: (1) multi-person scenes and motion dependent identification, (2) with well-defined boundaries, (3) relatively fine-grained classes of high complexity. Based on these guidelines, we build the dataset of MultiSports v1.0 by selecting 4 sports classes, collecting 3200 video clips, and annotating 37701 action instances with 902k bounding boxes. Our dataset is characterized with important properties of high diversity, dense annotation, and high quality. Our MultiSports, with its
SQA3D is a dataset for embodied scene understanding, where an agent needs to comprehend the scene it situates from an first person's perspective and answer questions. The questions are designed to be situated, embodied and knowledge-intensive. We offer three different modalities to represent a 3D scene: 3D scan, egocentric video and BEV picture.
The Robotic Pushing Dataset is a dataset for video prediction for real-world interactive agents which consists of 59,000 robot interactions involving pushing motions, including a test set with novel objects. In this dataset, accurate prediction of videos conditioned on the robot's future actions amounts to learning a "visual imagination" of different futures based on different courses of action.
12 PAPERS • NO BENCHMARKS YET
The SUN-SEG dataset is a high-quality per-frame annotated VPS dataset, which includes 158,690 frames from the famous SUN dataset. It extends the labels with diverse types, i.e., object mask, boundary, scribble, polygon, and visual attribute. It also introduces the pathological information from the original SUN dataset, including pathological classification labels, location information, and shape information.
12 PAPERS • 1 BENCHMARK
V2V4Real is a large-scale real-world multi-modal dataset for V2V perception. The data is collected by two vehicles equipped with multi modal sensors driving together through diverse scenarios. It covers a driving area of 410 km comprising 20K LiDAR frames, 40K RGB frames, 240K annotated 3D bounding boxes for 5 classes, and HDMaps that cover all the driving routes.
The dataset comprises 25 short sequences showing various objects in challenging backgrounds. Eight sequences are from the VOT2013 challenge (bolt, bicycle, david, diving, gymnastics, hand, sunshade, woman). The new sequences show complementary objects and backgrounds, for example a fish underwater or a surfer riding a big wave. The sequences were chosen from a large pool of sequences using a methodology based on clustering visual features of object and background so that those 25 sequences sample evenly well the existing pool.
VidSitu is a dataset for the task of semantic role labeling in videos (VidSRL). It is a large-scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (∼3K) and have been chosen to be both complex (∼4.2 unique verbs within a video) as well as diverse (∼200 verbs have more than 100 annotations each).
The Web Stereo Video Dataset consists of 553 stereoscopic videos from YouTube. This dataset has a wide variety of scene types, and features many nonrigid objects.
BL30K is a synthetic dataset rendered using Blender with ShapeNet's data. We break the dataset into six segments, each with approximately 5K videos. The videos are organized in a similar format as DAVIS and YouTubeVOS, so dataloaders for those datasets can be used directly. Each video is 160 frames long, and each frame has a resolution of 768*512. There are 3-5 objects per video, and each object has a random smooth trajectory -- we tried to optimize the trajectories in a greedy fashion to minimize object intersection (not guaranteed), with occlusions still possible (happen a lot in reality). See MiVOS for details.
11 PAPERS • NO BENCHMARKS YET
BOBSL is a large-scale dataset of British Sign Language (BSL). It comprises 1,962 episodes (approximately 1,400 hours) of BSL-interpreted BBC broadcast footage accompanied by written English subtitles. From horror, period and medical dramas, history, nature and science documentaries, sitcoms, children’s shows and programs covering cooking, beauty, business and travel, BOBSL covers a wide range of topics. The dataset features a total of 39 signers. Distinct signers appear in the training, validation and test sets for signer-independent evaluation.
The dataset contains 21 full-HD videos, each around 1 hr long, captured at six different locations. Vehicles in the videos (20 865 instances in total) are annotated with the precise speed measurements from optical gates using LiDAR and verified with several reference GPS tracks. The dataset is available for download and it contains the videos and metadata (calibration, lengths of features in image, annotations, and so on) for future comparison and evaluation.
11 PAPERS • 1 BENCHMARK