🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

37 dataset results for Multi-Object Tracking

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vertical, and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence. Ros et al. labeled 170 training images and 46 testing images (from the visual odome

3,216 PAPERS • 141 BENCHMARKS

BDD100K

Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study multitask learning for autonomous driving. Researchers are usually constrained to study a small set of problems on one dataset, while real-world computer vision applications require performing tasks of various complexities. We construct BDD100K, the largest driving video dataset with 100K videos and 10 tasks to evaluate the exciting progress of image recognition algorithms on autonomous driving. The dataset possesses geographic, environmental, and weather diversity, which is useful for training models that are less likely to be surprised by new conditions. Based on this diverse dataset, we build a benchmark for heterogeneous multitask learning and study how to solve the tasks together. Our experiments show that special training strategies are needed for existing models to perform such heterogeneous tasks. BDD100K opens the door for future studies in thi

359 PAPERS • 16 BENCHMARKS

MOT17 (Multiple Object Tracking 17)

The Multiple Object Tracking 17 (MOT17) dataset is a dataset for multiple object tracking. Similar to its previous version MOT16, this challenge contains seven different indoor and outdoor scenes of public places with pedestrians as the objects of interest. A video for each scene is divided into two clips, one for training and the other for testing. The dataset provides detections of objects in the video frames with three detectors, namely SDP, Faster-RCNN and DPM. The challenge accepts both on-line and off-line tracking approaches, where the latter are allowed to use the future video frames to predict tracks.

252 PAPERS • 2 BENCHMARKS

MOTChallenge

The MOTChallenge datasets are designed for the task of multiple object tracking. There are several variants of the dataset released each year, such as MOT15, MOT17, MOT20.

175 PAPERS • 8 BENCHMARKS

MOT16 (Multiple Object Tracking 2016)

The MOT16 dataset is a dataset for multiple object tracking. It a collection of existing and new data (part of the sources are from and ), containing 14 challenging real-world videos of both static scenes and moving scenes, 7 for training and 7 for testing. It is a large-scale dataset, composed of totally 110407 bounding boxes in training set and 182326 bounding boxes in test set. All video sequences are annotated under strict standards, their ground-truths are highly accurate, making the evaluation meaningful.

141 PAPERS • 2 BENCHMARKS

Virtual KITTI

Virtual KITTI is a photo-realistic synthetic video dataset designed to learn and evaluate computer vision models for several video understanding tasks: object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation.

120 PAPERS • 1 BENCHMARK

UAVDT (Unmanned Aerial Vehicle Benchmark Object Detection and Tracking)

UAVDT is a large scale challenging UAV Detection and Tracking benchmark (i.e., about 80, 000 representative frames from 10 hours raw videos) for 3 important fundamental tasks, i.e., object DETection (DET), Single Object Tracking (SOT) and Multiple Object Tracking (MOT).

81 PAPERS • 1 BENCHMARK

MOT15 (Multiple Object Tracking 15)

MOT2015 is a dataset for multiple object tracking. It contains 11 different indoor and outdoor scenes of public places with pedestrians as the objects of interest, where camera motion, camera angle and imaging condition vary greatly. The dataset provides detections generated by the ACF-based detector.

66 PAPERS • 5 BENCHMARKS

DanceTrack

A large-scale multi-object tracking dataset for human tracking in occlusion, frequent crossover, uniform appearance and diverse body gestures. It is proposed to emphasize the importance of motion analysis in multi-object tracking instead of mainly appearance-matching-based diagram.

62 PAPERS • 1 BENCHMARK

Wildtrack

Wildtrack is a large-scale and high-resolution dataset. It has been captured with seven static cameras in a public open area, and unscripted dense groups of pedestrians standing and walking. Together with the camera frames, we provide an accurate joint (extrinsic and intrinsic) calibration, as well as 7 series of 400 annotated frames for detection at a rate of 2 frames per second. This results in over 40 000 bounding boxes delimiting every person present in the area of interest, for a total of more than 300 individuals.

52 PAPERS • 2 BENCHMARKS

TAO (Tracking Any Object Dataset)

TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. A bottom-up approach was used for discovering a large vocabulary of 833 categories, an order of magnitude more than prior tracking benchmarks.

38 PAPERS • 1 BENCHMARK

Virtual KITTI 2

Virtual KITTI 2 is an updated version of the well-known Virtual KITTI dataset which consists of 5 sequence clones from the KITTI tracking benchmark. In addition, the dataset provides different variants of these sequences such as modified weather conditions (e.g. fog, rain) or modified camera configurations (e.g. rotated by 15◦). For each sequence we provide multiple sets of images containing RGB, depth, class segmentation, instance segmentation, flow, and scene flow data. Camera parameters and poses as well as vehicle locations are available as well. In order to showcase some of the dataset’s capabilities, we ran multiple relevant experiments using state-of-the-art algorithms from the field of autonomous driving. The dataset is available for download at https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds.

33 PAPERS • 1 BENCHMARK

KITTI MOTS (KITTI Multi-Object Tracking and Segmentation (MOTS) Evaluation)

The Multi-Object and Segmentation (MOTS) benchmark [2] consists of 21 training sequences and 29 test sequences. It is based on the KITTI Tracking Evaluation 2012 and extends the annotations to the Multi-Object and Segmentation (MOTS) task. To this end, we added dense pixel-wise segmentation labels for every object. We evaluate submitted results using the metrics HOTA, CLEAR MOT, and MT/PT/ML. We rank methods by HOTA [1]. Our development kit and GitHub evaluation code provide details about the data format as well as utility functions for reading and writing the label files. (adapted for the segmentation case). Evaluation is performed using the code from the TrackEval repository.

26 PAPERS • 1 BENCHMARK

MOT20

MOT20 is a dataset for multiple object tracking. The dataset contains 8 challenging video sequences (4 train, 4 test) in unconstrained environments, from crowded places such as train stations, town squares and a sports stadium. Image Source: https://motchallenge.net/vis/MOT20-04

26 PAPERS • 2 BENCHMARKS

HiEve

HiEve (Human-in-Events)

A new large-scale dataset for understanding human motions, poses, and actions in a variety of realistic events, especially crowd & complex events. It contains a record number of poses (>1M), the largest number of action labels (>56k) for complex events, and one of the largest number of trajectories lasting for long terms (with average trajectory length >480). Besides, an online evaluation server is built for researchers to evaluate their approaches.

18 PAPERS • 1 BENCHMARK

MultiviewX

MultiviewX is a synthetic Multiview pedestrian detection dataset. It is build using pedestrian models from PersonX, in Unity. The MultiviewX dataset covers a square of 16 meters by 25 meters. The ground plane is quantized into a 640x1000 grid. There are 6 cameras with overlapping field-of-view in the MultiviewX dataset, each of which outputs a 1080x1920 resolution image. On average, 4.41 cameras are covering the same location.

17 PAPERS • 2 BENCHMARKS

SportsMOT (SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes)

Motivation Multi-object tracking (MOT) is a fundamental task in computer vision, aiming to estimate objects (e.g., pedestrians and vehicles) bounding boxes and identities in video sequences.

17 PAPERS • 2 BENCHMARKS

SeaDronesSee (SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water)

SeaDronesSee is a large-scale data set aimed at helping develop systems for Search and Rescue (SAR) using Unmanned Aerial Vehicles (UAVs) in maritime scenarios. Building highly complex autonomous UAV systems that aid in SAR missions requires robust computer vision algorithms to detect and track objects or persons of interest. This data set provides three sets of tracks: object detection, single-object tracking and multi-object tracking. Each track consists of its own data set and leaderboard.

16 PAPERS • 3 BENCHMARKS

GMOT-40

GMOT-40 (Generic Multiple Object Tracking (GMOT))

GMOT-40 is the first public dense dataset for Generic Multiple Object Tracking (GMOT). It contains 40 carefully annotated sequences evenly distributed among 10 object categories. Beyond the data, a challenging protocal, one-shot GMOT, is adopted and a series of baseline algorithms is introduced. GMOT-40 is featured in

9 PAPERS • NO BENCHMARKS YET

PathTrack

PathTrack is a dataset for person tracking which contains more than 15,000 person trajectories in 720 sequences.

8 PAPERS • NO BENCHMARKS YET

MMPTRACK (Multi-camera Multiple People Tracking Dataset)

Multi-camera Multiple People Tracking (MMPTRACK) dataset has about 9.6 hours of videos, with over half a million frame-wise annotations. The dataset is densely annotated, e.g., per-frame bounding boxes and person identities are available, as well as camera calibration parameters. Our dataset is recorded with 15 frames per second (FPS) in five diverse and challenging environment settings., e.g., retail, lobby, industry, cafe, and office. This is by far the largest publicly available multi-camera multiple people tracking dataset.

5 PAPERS • 1 BENCHMARK

Youtube-VIS 2022 Validation

Video object segmentation has been studied extensively in the past decade due to its importance in understanding video spatial-temporal structures as well as its value in industrial applications. Recently, data-driven algorithms (e.g. deep learning) have become the dominant approach to computer vision problems and one of the most important keys to their successes is the availability of large-scale datasets. Previously, we presented the first large-scale video object segmentation dataset named YouTubeVOS and hosted the Large-scale Video Object Segmentation Challenge in conjuction with ECCV 2018, ICCV 2019 and CVPR 2021. This year, we are thrilled to invite you to the 4th Large-scale Video Object Segmentation Challenge in conjunction with CVPR 2022. The benchmark would be an augmented version of the YouTubeVOS dataset with more annotations. Some incorrect annotations are also corrected. For more details, check our website for the workshop and challenge.

5 PAPERS • 1 BENCHMARK

RailEye3D Dataset

The RailEye3D dataset, a collection of train-platform scenarios for applications targeting passenger safety and automation of train dispatching, consists of 10 image sequences captured at 6 railway stations in Austria. Annotations for multi-object tracking are provided in both an unified format as well as the ground-truth format used in the MOTChallenge.

3 PAPERS • NO BENCHMARKS YET

BEE23

BEE23 (Multi-bee Tracking Benchmark)

We collected 32 videos that record bee colony activity from different periods on several sunny days. The total size of the dataset is 3,562 frames and 43,169 annotations.

2 PAPERS • NO BENCHMARKS YET

Synthehicle

Synthehicle is a massive CARLA-based synthehic multi-vehicle multi-camera tracking dataset and includes ground truth for 2D detection and tracking, 3D detection and tracking, depth estimation, and semantic, instance and panoptic segmentation.

2 PAPERS • 1 BENCHMARK

3D-POP

The dataset is designed specifically to solve a range of computer vision problems (2D-3D tracking, posture) faced by biologists while designing behavior studies with animals.

1 PAPER • NO BENCHMARKS YET

3D-ZeF

3D-ZeF (3D ZebraFish Tracking Benchmark)

3D-ZeF dataset consists of eight sequences with a duration between 15-120 seconds and 1-10 free moving zebrafish. The videos have been annotated with a total of 86,400 points and bounding boxes.

1 PAPER • NO BENCHMARKS YET

CholecTrack20 (Multi-Perspective Multi-Class Multi-Tool Tracking Dataset)

CholecTrack20 is a surgical video dataset focusing on laparoscopic cholecystectomy and designed for surgical tool tracking, featuring 20 annotated videos. The dataset includes detailed labels for multi-class multi-tool tracking, offering trajectories for tool visibility within the camera scope, intracorporeal movement within the patient's body, and the life-long intraoperative trajectory of each tool. Annotations cover spatial coordinates, tool class, operator identity, phase, visual conditions (occlusion, bleeding, smoke), and more for tools like grasper, bipolar, hook, scissors, clipper, irrigator, and specimen bag, with annotations provided at 1 frame per second across 35K frames and 65K instance tool labels. The dataset uses official splits, allocating 10 videos for training, 2 for validation, and 8 for testing.

1 PAPER • NO BENCHMARKS YET

DIVOTrack

DIVOTrack is a cross-view multi-object tracking dataset for DIVerse Open scenes with dense tracking pedestrians in realistic and non-experimental environments. DIVOTrack has ten distinct scenarios and 550 cross-view tracks.

1 PAPER • NO BENCHMARKS YET

FISHTRAC

A dataset of real-world underwater videos annotated with multi-object tracking labels. The data was collected of the coast of the Big Island of Hawaii and the primary goal is to help scientists studying fish behavior, with the goal of conserving rare and beautiful fish species.

1 PAPER • NO BENCHMARKS YET

HA-ViD (HA-ViD: A Human Assembly Video Dataset)

Understanding comprehensive assembly knowledge from videos is critical for futuristic ultra-intelligent industry. To enable technological breakthrough, we present HA-ViD – an assembly video dataset that features representative industrial assembly scenarios, natural procedural knowledge acquisition process, and consistent human-robot shared annotations. Specifically, HA-ViD captures diverse collaboration patterns of real-world assembly, natural human behaviors and learning progression during assembly, and granulate action annotations to subject, action verb, manipulated object, target object, and tool. We provide 3222 multi-view and multi-modality videos, 1.5M frames, 96K temporal labels and 2M spatial labels. We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking. Importantly, we analyze their performance and the further reasoning steps for comprehending knowledge in assembly progress, process effici

1 PAPER • NO BENCHMARKS YET

LTFT (Long-Term Face Tracking)

Dataset originally conceived for multi-face tracking/detection for highly crowded scenarios. In these scenarios, the face is the only part that can be used to track the individuals.

1 PAPER • NO BENCHMARKS YET

PersonPath22

PersonPath22 is a large-scale multi-person tracking dataset containing 236 videos captured mostly from static-mounted cameras, collected from sources where we were given the rights to redistribute the content and participants have given explicit consent. Each video has ground-truth annotations including both bounding boxes and tracklet-ids for all the persons in each frame.

1 PAPER • 1 BENCHMARK

SoccerNet-GSR (SoccerNet Game State Reconstruction)

The SoccerNet Game State Reconstruction task is a novel high level computer vision task that is specific to sports analytics. It aims at recognizing the state of a sport game, i.e., identifying and localizing all sports individuals (players, referees, ..) on the field based on a raw input videos. SoccerNet-GSR is composed of 200 video sequences of 30 seconds, annotated with 9.37 million line points for pitch localization and camera calibration, as well as over 2.36 million athlete positions on the pitch with their respective role, team, and jersey number.

1 PAPER • 1 BENCHMARK

SoccerTrack Dataset

The SoccerTrack dataset comprises top-view and wide-view video footage annotated with bounding boxes. GNSS coordinates of each player are also provided. We hope that the SoccerTrack dataset will help advance the state of the art in multi-object tracking, especially in team sports.

1 PAPER • NO BENCHMARKS YET

TAO-Amodal

Our dataset augments the TAO dataset with amodal bounding box annotations for fully invisible, out-of-frame, and occluded objects. Note that this implies TAO-Amodal also includes modal segmentation masks (as visualized in the color overlays above). Our dataset encompasses 880 categories, aimed at assessing the occlusion reasoning capabilities of current trackers through the paradigm of Tracking Any Object with Amodal perception (TAO-Amodal).

1 PAPER • NO BENCHMARKS YET

GroOT

GroOT (Grounded Multiple Object Tracking)

One of the recent trends in vision problems is to use natural language captions to describe the objects of interest. This approach can overcome some limitations of traditional methods that rely on bounding boxes or category annotations. This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track, which allows users to track objects in videos by typing natural language descriptions. We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT, that contains videos with various types of objects and their corresponding textual captions of 256K words describing their appearance and action in detail. To cover a diverse range of scenes, GroOT was created using official videos and bounding box annotations from the MOT17, TAO and MOT20.

0 PAPER • NO BENCHMARKS YET