The Multiple Object Tracking 17 (MOT17) dataset is a dataset for multiple object tracking. Similar to its previous version MOT16, this challenge contains seven different indoor and outdoor scenes of public places with pedestrians as the objects of interest. A video for each scene is divided into two clips, one for training and the other for testing. The dataset provides detections of objects in the video frames with three detectors, namely SDP, Faster-RCNN and DPM. The challenge accepts both on-line and off-line tracking approaches, where the latter are allowed to use the future video frames to predict tracks.
291 PAPERS • 2 BENCHMARKS
The MOTChallenge datasets are designed for the task of multiple object tracking. There are several variants of the dataset released each year, such as MOT15, MOT17, MOT20.
192 PAPERS • 8 BENCHMARKS
The MOT16 dataset is a dataset for multiple object tracking. It a collection of existing and new data (part of the sources are from and ), containing 14 challenging real-world videos of both static scenes and moving scenes, 7 for training and 7 for testing. It is a large-scale dataset, composed of totally 110407 bounding boxes in training set and 182326 bounding boxes in test set. All video sequences are annotated under strict standards, their ground-truths are highly accurate, making the evaluation meaningful.
149 PAPERS • 2 BENCHMARKS
MOT2015 is a dataset for multiple object tracking. It contains 11 different indoor and outdoor scenes of public places with pedestrians as the objects of interest, where camera motion, camera angle and imaging condition vary greatly. The dataset provides detections generated by the ACF-based detector.
67 PAPERS • 5 BENCHMARKS
Motivation Multi-object tracking (MOT) is a fundamental task in computer vision, aiming to estimate objects (e.g., pedestrians and vehicles) bounding boxes and identities in video sequences.
32 PAPERS • 3 BENCHMARKS
Multi-camera Multiple People Tracking (MMPTRACK) dataset has about 9.6 hours of videos, with over half a million frame-wise annotations. The dataset is densely annotated, e.g., per-frame bounding boxes and person identities are available, as well as camera calibration parameters. Our dataset is recorded with 15 frames per second (FPS) in five diverse and challenging environment settings., e.g., retail, lobby, industry, cafe, and office. This is by far the largest publicly available multi-camera multiple people tracking dataset.
6 PAPERS • 1 BENCHMARK
We collected 32 videos that record bee colony activity from different periods on several sunny days. The total size of the dataset is 3,562 frames and 43,169 annotations.
2 PAPERS • NO BENCHMARKS YET
CholecTrack20 is a surgical video dataset focusing on laparoscopic cholecystectomy and designed for surgical tool tracking, featuring 20 annotated videos. The dataset includes detailed labels for multi-class multi-tool tracking, offering trajectories for tool visibility within the camera scope, intracorporeal movement within the patient's body, and the life-long intraoperative trajectory of each tool. Annotations cover spatial coordinates, tool class, operator identity, phase, visual conditions (occlusion, bleeding, smoke), and more for tools like grasper, bipolar, hook, scissors, clipper, irrigator, and specimen bag, with annotations provided at 1 frame per second across 35K frames and 65K instance tool labels. The dataset uses official splits, allocating 10 videos for training, 2 for validation, and 8 for testing.
2 PAPERS • 3 BENCHMARKS
PersonPath22 is a large-scale multi-person tracking dataset containing 236 videos captured mostly from static-mounted cameras, collected from sources where we were given the rights to redistribute the content and participants have given explicit consent. Each video has ground-truth annotations including both bounding boxes and tracklet-ids for all the persons in each frame.
2 PAPERS • 1 BENCHMARK
The Oxford Town Center dataset is a 5-minute video with 7500 frames annotated, which is divided into 6500 for training and 1000 for testing data for pedestrian detection. The data was recorded from a CCTV camera in Oxford for research and development into activity and face recognition.
1 PAPER • 1 BENCHMARK