The Sims4Action Dataset: a videogame-based dataset for Synthetic→Real domain adaptation for human activity recognition.
5 PAPERS • NO BENCHMARKS YET
The SynthHands dataset is a dataset for hand pose estimation which consists of real captured hand motion retargeted to a virtual hand with natural backgrounds and interactions with different objects. The dataset contains data for male and female hands, both with and without interaction with objects. While the hand and foreground object are synthtically generated using Unity, the motion was obtained from real performances as described in the accompanying paper. In addition, real object textures and background images (depth and color) were used. Ground truth 3D positions are provided for 21 keypoints of the hand.
Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).
5 PAPERS • 1 BENCHMARK
TutorialVQA is a new type of dataset used to find answer spans in tutorial videos. The dataset includes about 6,000 triples, comprised of videos, questions, and answer spans manually collected from screencast tutorial videos with spoken narratives for a photo-editing software.
The UCLA Aerial Event Dataest has been captured by a low-cost hex-rotor with a GoPro camera, which is able to eliminate the high frequency vibration of the camera and hold in air autonomously through a GPS and a barometer. It can also fly 20 ∼ 90m above the ground and stays 5 minutes in air.
UESTC RGB-D Varying-view action database contains 40 categories of aerobic exercise. We utilized 2 Kinect V2 cameras in 8 fixed directions and 1 round direction to capture these actions with the data modalities of RGB video, 3D skeleton sequences and depth map sequences.
Contains ~9K videos of human agents performing various actions, annotated with 3 types of commonsense descriptions.
VCSL (Video Copy Segment Localization) is a new comprehensive segment-level annotated video copy dataset. Compared with existing copy detection datasets restricted by either video-level annotation or small-scale, VCSL not only has two orders of magnitude more segment level labelled data, with 160k realistic video copy pairs containing more than 280k localized copied segment pairs, but also covers a variety of video categories and a wide range of video duration. All the copied segments inside each collected video pair are manually extracted and accompanied by precisely annotated starting and ending timestamps.
VIPER is a benchmark suite for visual perception. The benchmark is based on more than 250K high-resolution video frames, all annotated with ground-truth data for both low-level and high-level vision tasks, including optical flow, semantic instance segmentation, object detection and tracking, object-level 3D scene layout, and visual odometry. Ground-truth data for all tasks is available for every frame. The data was collected while driving, riding, and walking a total of 184 kilometers in diverse ambient conditions in a realistic virtual world.
A large-scale multi-modal dataset to facilitate research and studies that concentrate on vision-wireless systems. The Vi-Fi dataset is a large-scale multi-modal dataset that consists of vision, wireless and smartphone motion sensor data of multiple participants and passer-by pedestrians in both indoor and outdoor scenarios. In Vi-Fi, vision modality includes RGB-D video from a mounted camera. Wireless modality comprises smartphone data from participants including WiFi FTM and IMU measurements.
VideoCube is a high-quality and large-scale benchmark to create a challenging real-world experimental environment for Global Instance Tracking (GIT). MGIT is a high-quality and multi-modal benchmark based on VideoCube-Tiny to fully represent the complex spatio-temporal and causal relationships coupled in longer narrative content.
WWW Crowd provides 10,000 videos with over 8 million frames from 8,257 diverse scenes, therefore offering a comprehensive dataset for the area of crowd understanding.
Video object segmentation has been studied extensively in the past decade due to its importance in understanding video spatial-temporal structures as well as its value in industrial applications. Recently, data-driven algorithms (e.g. deep learning) have become the dominant approach to computer vision problems and one of the most important keys to their successes is the availability of large-scale datasets. Previously, we presented the first large-scale video object segmentation dataset named YouTubeVOS and hosted the Large-scale Video Object Segmentation Challenge in conjuction with ECCV 2018, ICCV 2019 and CVPR 2021. This year, we are thrilled to invite you to the 4th Large-scale Video Object Segmentation Challenge in conjunction with CVPR 2022. The benchmark would be an augmented version of the YouTubeVOS dataset with more annotations. Some incorrect annotations are also corrected. For more details, check our website for the workshop and challenge.
iMiGUE is a dataset for emotional artificial intelligence research: identity-free video dataset for Micro-Gesture Understanding and Emotion analysis (iMiGUE). Different from existing public datasets, iMiGUE focuses on nonverbal body gestures without using any identity information, while the predominant researches of emotion analysis concern sensitive biometric data, like face and speech. Most importantly, iMiGUE focuses on micro-gestures, i.e., unintentional behaviors driven by inner feelings, which are different from ordinary scope of gestures from other gesture datasets which are mostly intentionally performed for illustrative purposes. Furthermore, iMiGUE is designed to evaluate the ability of models to analyze the emotional states by integrating information of recognized micro-gesture, rather than just recognizing prototypes in the sequences separately (or isolatedly).
iPer is a new dataset, with diverse styles of clothes in videos, for the evaluation of human motion imitation, appearance transfer, and novel view synthesis. There are 30 subjects of different conditions of shape, height, and gender. Each subject wears different clothes and performs an A-pose video and a video with random actions. There are 103 clothes in total. The whole dataset contains 206 video sequences with 241,564 frames.
Choosing optimal maskers for existing soundscapes to effect a desired perceptual change via soundscape augmentation is non-trivial due to extensive varieties of maskers and a dearth of benchmark datasets with which to compare and develop soundscape augmentation models. To address this problem, we make publicly available the ARAUS (Affective Responses to Augmented Urban Soundscapes) dataset, which comprises a five-fold cross-validation set and independent test set totaling 25,440 unique subjective perceptual responses to augmented soundscapes presented as audio-visual stimuli. Each augmented soundscape is made by digitally adding "maskers" (bird, water, wind, traffic, construction, or silence) to urban soundscape recordings at fixed soundscape-to-masker ratios. Responses were then collected by asking participants to rate how pleasant, annoying, eventful, uneventful, vibrant, monotonous, chaotic, calm, and appropriate each augmented soundscape was, in accordance with ISO 12913-2:2018. Pa
4 PAPERS • NO BENCHMARKS YET
BS-RSC is a real-world rolling shutter (RS) correction dataset and a corresponding model to correct the RS frames in a distorted video. Real distorted videos with corresponding ground truth are recorded simultaneously via a well-designed beam-splitter-based acquisition system. BSRSC contains various motions of both camera and objects in dynamic scenes.
Video sequences from a glasshouse environment in Campus Kleinaltendorf(CKA), University of Bonn, captured by PATHoBot, a glasshouse monitoring robot.
The Bimanual Actions Dataset is a collection of 540 RGB-D videos, showing subjects perform bimanual actions in a kitchen or workshop context. The main purpose for its compilation is to research bimanual human behaviour in order to eventually improve the capabilities of humanoid robots.
The nine (moving camera) videos in this benchmark exhibit camouflaged animals that are difficult to see in a single frame, but can be detected based upon their motion across frames.
4 PAPERS • 1 BENCHMARK
CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy. The provided texts describes both static and dynamic attributes precisely.
ChangeSim is a dataset aimed at online scene change detection (SCD) and more. The data is collected in photo-realistic simulation environments with the presence of environmental non-targeted variations, such as air turbidity and light condition changes, as well as targeted object changes in industrial indoor environments. By collecting data in simulations, multi-modal sensor data and precise ground truth labels are obtainable such as the RGB image, depth image, semantic segmentation, change segmentation, camera poses, and 3D reconstructions. While the previous online SCD datasets evaluate models given well-aligned image pairs, ChangeSim also provides raw unpaired sequences that present an opportunity to develop an online SCD model in an end-to-end manner, considering both pairing and detection. Experiments show that even the latest pair-based SCD models suffer from the bottleneck of the pairing process, and it gets worse when the environment contains the non-targeted variations.
4 PAPERS • 2 BENCHMARKS
The TUB CrowdFlow is a synthetic dataset that contains 10 sequences showing 5 scenes. Each scene is rendered twice: with a static point of view and a dynamic camera to simulate drone/UAV based surveillance. The scenes are render using Unreal Engine at HD resolution (1280x720) at 25 fps, which is typical for current commercial CCTV surveillance systems. The total number of frames is 3200.
DarkTrack2021 is a challenging nighttime UAV tracking benchmark, which contains 110 challenging sequences with over 100 K frames in total.
The Dataset of Multimodal Semantic Egocentric Video (DoMSEV) contains 80-hours of multimodal (RGB-D, IMU, and GPS) data related to First-Person Videos with annotations for recorder profile, frame scene, activities, interaction, and attention.
DroneCrowd is a benchmark for object detection, tracking and counting algorithms in drone-captured videos. It is a drone-captured large scale dataset formed by 112 video clips with 33,600 HD frames in various scenarios. Notably, it has annotations for 20,800 people trajectories with 4.8 million heads and several video-level attributes.
The dataset contains 7000 videos: native, altered and exchanged through social platforms. The altered contents include manipulations with FFmpeg, AVIdemux, Kdenlive and Adobe Premiere. The social platforms used to exchange the native and altered videos are Facebook, Tiktok, Youtube and Weibo.
GolfDB is a high-quality video dataset created for general recognition applications in the sport of golf, and specifically for the task of golf swing sequencing.
HowTo100M Adverbs is a subset from HowTo100M with mined adverbs from 83 tasks in HowTo100M. The annotations were obtained from automatically transcribed narrations of instructional videos. The dataset contains originally 5,824 clips annotated with action-adverb pairs from 72 verbs and 6 adverbs. Source: How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs
This is a dataset for video deinterlacing problem. The dataset contains 28 video sequences. Each sequence's length is 60 frames. Resolution of all video sequences is 1920x1080. TFF interlacing was used to get interlaced data from GT.
Understanding what makes a video memorable has a very broad range of current applications, e.g., education and learning, content retrieval and search, content summarization, storytelling, targeted advertising, content recommendation and filtering. This task requires participants to automatically predict memorability scores for videos that reflect the probability for a video to be remembered over both a short and long term. Participants will be provided with an extensive data set of videos with memorability annotations, related information, pre-extracted state-of-the-art visual features, and Electroencephalography (EEG) recordings.
MovingFashion is a dataset for video-to-shop, the task of retrieving clothes which are worn in social media videos. MovingFashion is composed of 14,855 social videos, each one of them associated with e-commerce "shop" images where the corresponding clothing items are clearly portrayed.
For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in
The dataset contains information on what video segments a specific user considers a highlight. Having this kind of data allows for strong personalization models, as specific examples of what a user is interested in help models obtain a fine-grained understanding of that specific user.
Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks that require understanding of memory, abstract patterns, physics, and semantics – across visual, audio, and text modalities. The benchmark consists of 11.6k videos, 23s average length, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels: object and point tracks, temporal action and sound segments, multiple-choice video question-answers and grounded video question-answers. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or fine tuning regime.
QUVA Repetition dataset consists of 100 videos displaying a wide variety of repetitive video dynamics, including swimming, stirring, cutting, combing and music-making. All videos have been annotated with individual cycle bounds and a total repetition count.
Collects dense per-video-shot concept annotations.
The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set of datasets, e.g. Microsoft COCO and Pascal VOC. Due to image retrieval and annotation costs, these datasets consist largely of images found on the web and do not represent many real-life domains that are being modelled in practice, e.g. satellite, microscopic and gaming, making it difficult to assert the degree of generalization learned by the model.
Specially designed to evaluate active learning for video object detection in road scenes.
This data collection consists of images acquired during chemoradiotherapy of 20 locally-advanced, non-small cell lung cancer patients. The images include four-dimensional (4D) fan beam (4D-FBCT) and 4D cone beam CT (4D-CBCT). All patients underwent concurrent radiochemotherapy to a total dose of 64.8-70 Gy using daily 1.8 or 2 Gy fractions. scription of the dataset.
A collection of 2511 recipes for zero-shot learning, recognition and anticipation.
TinyVIRAT contains natural low-resolution activities. The actions in TinyVIRAT videos have multiple labels and they are extracted from surveillance videos which makes them realistic and more challenging.
Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events covering 100 event categories. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. We believe our UnAV-100, with its realistic complexity, can promote the exploration on comprehensive audio-visual video understanding.
VOT2019 is a Visual Object Tracking benchmark for short-term tracking in RGB.
VidOR (Video Object Relation) dataset contains 10,000 videos (98.6 hours) from YFCC100M collection together with a large amount of fine-grained annotations for relation understanding. In particular, 80 categories of objects are annotated with bounding-box trajectory to indicate their spatio-temporal location in the videos; and 50 categories of relation predicates are annotated among all pairs of annotated objects with starting and ending frame index. This results in around 50,000 object and 380,000 relation instances annotated. To use the dataset for model development, the dataset is split into 7,000 videos for training, 835 videos for validation, and 2,165 videos for testing.
This dataset is an extension of MASAC, a multimodal, multi-party, Hindi-English code-mixed dialogue dataset compiled from the popular Indian TV show, ‘Sarabhai v/s Sarabhai’. WITS was created by augmenting MASAC with natural language explanations for each sarcastic dialogue. The dataset consists of the transcribed sarcastic dialogues from 55 episodes of the TV show, along with audio and video multimodal signals. It was designed to facilitate Sarcasm Explanation in Dialogue (SED), a novel task aimed at generating a natural language explanation for a given sarcastic dialogue, that spells out the intended irony. Each data instance in WITS is associated with a corresponding video, audio, and textual transcript where the last utterance is sarcastic in nature. All the final selected explanations contain the following attributes:
4 PAPERS • 3 BENCHMARKS