The Kinetics dataset is a large-scale, high-quality dataset for human action recognition in videos. The dataset consists of around 500,000 video clips covering 600 human action classes with at least 600 video clips for each action class. Each video clip lasts around 10 seconds and is labeled with a single action class. The videos are collected from YouTube.
1,320 PAPERS • 30 BENCHMARKS
The Charades dataset is composed of 9,848 videos of daily indoors activities with an average length of 30 seconds, involving interactions with 46 objects classes in 15 types of indoor scenes and containing a vocabulary of 30 verbs leading to 157 action classes. Each video in this dataset is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacting objects. 267 different users were presented with a sentence, which includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence. In total, the dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. In the standard split there are7,986 training video and 1,863 validation video.
422 PAPERS • 6 BENCHMARKS
Charades-STA is a new dataset built on top of Charades by adding sentence temporal annotations.
231 PAPERS • 4 BENCHMARKS
The ShanghaiTech Campus dataset has 13 scenes with complex light conditions and camera angles. It contains 130 abnormal events and over 270, 000 training frames. Moreover, both the frame-level and pixel-level ground truth of abnormal events are annotated in this dataset.
204 PAPERS • 8 BENCHMARKS
SEED-Bench consists of 19K multiple choice questions with accurate human annotations (~6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality.
122 PAPERS • NO BENCHMARKS YET
AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. Each of the video clips has been exhaustively annotated by human annotators, and together they represent a rich variety of scenes, recording conditions, and expressions of human activity. There are annotations for:
112 PAPERS • 7 BENCHMARKS
A novel large-scale corpus of manual annotations for the SoccerNet video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production.
57 PAPERS • 7 BENCHMARKS
MovieNet is a holistic dataset for movie understanding. MovieNet contains 1,100 movies with a large amount of multi-modal data, e.g. trailers, photos, plot descriptions, etc.. Besides, different aspects of manual annotations are provided in MovieNet, including 1.1M characters with bounding boxes and identities, 42K scene boundaries, 2.5K aligned description sentences, 65K tags of place and action, and 92 K tags of cinematic style.
53 PAPERS • 1 BENCHMARK
The EPIC-KITCHENS-55 dataset comprises a set of 432 egocentric videos recorded by 32 participants in their kitchens at 60fps with a head mounted camera. There is no guiding script for the participants who freely perform activities in kitchens related to cooking, food preparation or washing up among others. Each video is split into short action segments (mean duration is 3.7s) with specific start and end times and a verb and noun annotation describing the action (e.g. ‘open fridge‘). The verb classes are 125 and the noun classes 331. The dataset is divided into one train and two test splits.
41 PAPERS • 3 BENCHMARKS
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodAL understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
41 PAPERS • NO BENCHMARKS YET
A new multitask action quality assessment (AQA) dataset, the largest to date, comprising of more than 1600 diving samples; contains detailed annotations for fine-grained action recognition, commentary generation, and estimating the AQA score. Videos from multiple angles provided wherever available.
33 PAPERS • 2 BENCHMARKS
Contains 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Charades-Ego furthermore shares activity classes, scripts, and methodology with the Charades dataset, that consist of additional 82.3 hours of third-person video with 66,500 activity instances.
32 PAPERS • 1 BENCHMARK
VidSitu is a dataset for the task of semantic role labeling in videos (VidSRL). It is a large-scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (∼3K) and have been chosen to be both complex (∼4.2 unique verbs within a video) as well as diverse (∼200 verbs have more than 100 annotations each).
18 PAPERS • NO BENCHMARKS YET
How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. STAR Benchmark is a novel benchmark for Situated Reasoning, which provides 60K challenging situated questions in four types of tasks, 140K situated hypergraphs, symbolic situation programs, and logic-grounded diagnosis for real-world video situations. (Data Download, STAR Leaderboard)
17 PAPERS • 2 BENCHMARKS
Car Crash Dataset (CCD) is collected for traffic accident analysis. It contains real traffic accident videos captured by dashcam mounted on driving vehicles, which is critical to developing safety-guaranteed self-driving systems. CCD is distinguished from existing datasets for diversified accident annotations, including environmental attributes (day/night, snowy/rainy/good weather conditions), whether ego-vehicles involved, accident participants, and accident reason descriptions.
16 PAPERS • 1 BENCHMARK
HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx.~572k videos in total with 9 million annotations for training, validation, and test set spanning over 3142 labels. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes, and concepts which naturally captures the real-world scenarios.
16 PAPERS • NO BENCHMARKS YET
A large-scale dataset for retrieval and event localisation in video. A unique feature of the dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content.
14 PAPERS • 1 BENCHMARK
The DramaQA focuses on two perspectives: 1) Hierarchical QAs as an evaluation metric based on the cognitive developmental stages of human intelligence. 2) Character-centered video annotations to model local coherence of the story. The dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels.
12 PAPERS • 3 BENCHMARKS
Aesthetic Visual Analysis is a dataset for aesthetic image assessment that contains over 250,000 images along with a rich variety of meta-data including a large number of aesthetic scores for each image, semantic labels for over 60 categories as well as labels related to photographic style.
11 PAPERS • 3 BENCHMARKS
Given 10 minimally contrastive (highly similar) images and a complex description for one of them, the task is to retrieve the correct image. The source of most images are videos and descriptions as well as retrievals come from human.
11 PAPERS • 1 BENCHMARK
Moviescope is a large-scale dataset of 5,000 movies with corresponding video trailers, posters, plots and metadata. Moviescope is based on the IMDB 5000 dataset consisting of 5.043 movie records. It is augmented by crawling video trailers associated with each movie from YouTube and text plots from Wikipedia.
6 PAPERS • NO BENCHMARKS YET
Contains ~9K videos of human agents performing various actions, annotated with 3 types of commonsense descriptions.
We introduce InfiniBench a comprehensive benchmark for very long video understanding, which presents 1) The longest video duration, averaging 76.34 minutes; 2) The largest number of question-answer pairs, 108.2K; 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions; 4) Humancentric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large MultiModality Models (LMMs) on each skill, including the commercial model Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark.Our results show that the best AI models such Gemini struggles to perform well with 42.72% average accuracy and 2.71 out of 5 average score.
5 PAPERS • NO BENCHMARKS YET
The MLB-YouTube dataset is a new, large-scale dataset consisting of 20 baseball games from the 2017 MLB post-season available on YouTube with over 42 hours of video footage. The dataset consists of two components: segmented videos for activity recognition and continuous videos for activity classification. It is quite challenging as it is created from TV broadcast baseball games where multiple different activities share the camera angle. Further, the motion/appearance difference between the various activities is quite small.
This dataset has the following citation: M. Soliman, M. Kamal, M. Nashed, Y. Mostafa, B. Chawky, D. Khattab, “ Violence Recognition from Videos using Deep Learning Techniques”, Proc. 9th International Conference on Intelligent Computing and Information Systems (ICICIS'19), Cairo, pp. 79-84, 2019. please use it in case of using the dataset in research or engineering purpose ) when we start our Graduation Project Violence Recognition from Videos we found that there is shortage in available datasets related to violence between individuals so we decide to create new big dataset with variety of scenes
5 PAPERS • 1 BENCHMARK
Comprises of 171,191 video segments from 346 high-quality soccer games. The database contains 702,096 bounding boxes, 37,709 essential event labels with time boundary and 17,115 highlight annotations for object detection, action recognition, temporal action localization, and highlight detection tasks.
ChronoMagic with 2265 metamorphic time-lapse videos, each accompanied by a detailed caption.
4 PAPERS • NO BENCHMARKS YET
Collects dense per-video-shot concept annotations.
4 PAPERS • 1 BENCHMARK
A novel dataset that represents complex conversational interactions between two individuals via 3D pose. 8 pairwise interactions describing 7 separate conversation based scenarios were collected using two Kinect depth sensors.
3 PAPERS • NO BENCHMARKS YET
DeepSportradar is a benchmark suite of computer vision tasks, datasets and benchmarks for automated sport understanding. DeepSportradar currently supports four challenging tasks related to basketball: ball 3D localization, camera calibration, player instance segmentation and player re-identification. For each of the four tasks, a detailed description of the dataset, objective, performance metrics, and the proposed baseline method are provided.
Largest, first-of-its-kind, in-the-wild, fine-grained workout/exercise posture analysis dataset, covering three different exercises: BackSquat, Barbell Row, and Overhead Press. Seven different types of exercise errors are covered. Unlabeled data is also provided to facilitate self-supervised learning.
A multilingual, multimodal and multi-aspect, expertly-annotated dataset of diverse short videos extracted from short-video social media platform - Moj. 3MASSIV comprises of 50k short videos (~20 seconds average duration) and 100K unlabeled videos in 11 different languages and captures popular short video trends like pranks, fails, romance, comedy expressed via unique audio-visual formats like self-shot videos, reaction videos, lip-synching, self-sung songs, etc.
2 PAPERS • NO BENCHMARKS YET
We provide a database containing shot scale annotations (i.e., the apparent distance of the camera from the subject of a filmed scene) for more than 792,000 image frames. Frames belong to 124 full movies from the entire filmographies by 6 important directors: Martin Scorsese, Jean-Luc Godard, Béla Tarr, Federico Fellini, Michelangelo Antonioni, and Ingmar Bergman. Each frame, extracted from videos at 1 frame per second, is annotated on the following scale categories: Extreme Close Up (ECU), Close Up (CU), Medium Close Up (MCU), Medium Shot (MS), Medium Long Shot (MLS), Long Shot (LS), Extreme Long Shot (ELS), Foreground Shot (FS), and Insert Shots (IS). Two independent coders annotated all frames from the 124 movies, whilst a third one checked their coding and made decisions in cases of disagreement. The CineScale database enables AI-driven interpretation of shot scale data and opens to a large set of research activities related to the automatic visual analysis of cinematic material, s
Contains annotations of human activity with different sub-actions, e.g., activity Ping-Pong with four sub-actions which are pickup-ball, hit, bounce-ball and serve.
Stanford-ECM is an egocentric multimodal dataset which comprises about 27 hours of egocentric video augmented with heart rate and acceleration data. The lengths of the individual videos cover a diverse range from 3 minutes to about 51 minutes in length. A mobile phone was used to collect egocentric video at 720x1280 resolution and 30 fps, as well as triaxial acceleration at 30Hz. The mobile phone was equipped with a wide-angle lens, so that the horizontal field of view was enlarged from 45 degrees to about 64 degrees. A wrist-worn heart rate sensor was used to capture the heart rate every 5 seconds. The phone and heart rate monitor was time-synchronized through Bluetooth, and all data was stored in the phone’s storage. Piecewise cubic polynomial interpolation was used to fill in any gaps in heart rate data. Finally, data was aligned to the millisecond level at 30 Hz.
VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.
This large collection of over 161,000 video-label pairs of video clips, shows humans drawing letters and digits in the air, and is used to evaluate a model’s ability to classify articulated motions correctly. Unlike existing video datasets, AirLetters’ accurate classification predictions rely on discerning motion patterns and integrating information presented by the video over time (i.e., over many frames of video). That study revealed that while trivial for humans, accurate representations of complex articulated motions remain an open problem for end-to-end learning for video understanding models.
1 PAPER • NO BENCHMARKS YET
The feature files are named with the youtube IDs. https://drive.google.com/drive/folders/10-6hkQxMKMGwLXANxfPRE7xw5PKiMjLn?usp=sharing
Description
CinePile is a question-answering-based, long-form video understanding dataset. It has been created using advanced large language models (LLMs) with human-in-the-loop pipeline leveraging existing human-generated raw data. It consists of approximately 300,000 training data points and 5,000 test data points.
1 PAPER • 1 BENCHMARK
This spatio-temporal actions dataset for video understanding consists of 4 parts: original videos, cropped videos, video frames, and annotation files. This dataset uses a proposed new multi-person annotation method of spatio-temporal actions. First, we use ffmpeg to crop the videos and frame the videos; then use yolov5 to detect human in the video frame, and then use deep sort to detect the ID of the human in the video frame. By processing the detection results of yolov5 and deep sort, we can get the annotation file of the spatio-temporal action dataset to complete the work of customizing the spatio-temporal action dataset.
Kinetics-GEB+ (Generic Event Boundary Captioning, Grounding and Retrieval) is a dataset that consists of over 170k boundaries associated with captions describing status changes in the generic events in 12K videos.
1 PAPER • 3 BENCHMARKS
A benchmark that focuses on the sampling dilemma in long-video tasks. The LSDBench dataset is designed to evaluate the sampling efficiency of long-video VLMs. It consists of multiple-choice question-answer pairs based on hour-long videos, focusing on dense and short-duration actions with high Necessary Sampling Density (NSD).
Music recommendation for videos attracts growing interest in multi-modal research. However, existing systems focus primarily on content compatibility, often ignoring the users’ preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat, a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user’s preference as inputs and retrieves an appropriate music matching the context. The reasoning module, equipped with the power of Large Language Model (Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat, we build
Placepedia contains 240K places with 35M images from all over the world. Each place is associated with its district, city/town/village, state/province, country, continent, and a large amount of diverse photos. Both administrative areas and places have rich side information, e.g. discription, population, category, function. In addition, two cleaned subsets (Places-Coarse and Places-Fine) for experiments are provided.
SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset.
Trailers12k is a movie trailer dataset comprised of 12,000 titles associated to ten genres. It distinguishes from other datasets by its collection procedure aimed at providing a high-quality publicly available dataset.