Object Tracking Benchmark (OTB) is a visual tracking benchmark that is widely used to evaluate the performance of a visual tracking algorithm. The dataset contains a total of 100 sequences and each is annotated frame-by-frame with bounding boxes and 11 challenge attributes. OTB-2013 dataset contains 51 sequences and the OTB-2015 dataset contains all 100 sequences of the OTB dataset.
354 PAPERS • 4 BENCHMARKS
OTB-2015, also referred as Visual Tracker Benchmark, is a visual tracking dataset. It contains 100 commonly used video sequences for evaluating visual tracking. Image Source: http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html
168 PAPERS • 1 BENCHMARK
LaSOT is a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT one of the largest densely annotated tracking benchmark. The average video length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view.
165 PAPERS • 1 BENCHMARK
TrackingNet is a large-scale tracking dataset consisting of videos in the wild. It has a total of 30,643 videos split into 30,132 training videos and 511 testing videos, with an average of 470,9 frames.
131 PAPERS • 1 BENCHMARK
OTB2013 is the previous version of the current OTB2015 Visual Tracker Benchmark. It contains only 50 tracking sequences, as opposed to the 100 sequences in the current version of the benchmark.
110 PAPERS • 2 BENCHMARKS
OxUva is a dataset and benchmark for evaluating single-object tracking algorithms.
27 PAPERS • NO BENCHMARKS YET
Source: https://www.vicos.si/Projects/CDTB 4.2 State-of-the-art Comparison A TH CTB (color-and-depth visual object tracking) dataset is recorded by several passive and active RGB-D setups and contains indoor as well as outdoor sequences acquired in direct sunlight. The sequences were recorded to contain significant object pose change, clutter, occlusion, and periods of long-term target absence to enable tracker evaluation under realistic conditions. Sequences are per-frame annotated with 13 visual attributes for detailed analysis. It contains around 100,000 samples. Image Source: https://www.vicos.si/Projects/CDTB
13 PAPERS • NO BENCHMARKS YET
A new long video dataset and benchmark for single object tracking. The dataset consists of 50 HD videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing generic datasets for visual tracking.
The dataset comprises 25 short sequences showing various objects in challenging backgrounds. Eight sequences are from the VOT2013 challenge (bolt, bicycle, david, diving, gymnastics, hand, sunshade, woman). The new sequences show complementary objects and backgrounds, for example a fish underwater or a surfer riding a big wave. The sequences were chosen from a large pool of sequences using a methodology based on clustering visual features of object and background so that those 25 sequences sample evenly well the existing pool.
11 PAPERS • NO BENCHMARKS YET
Tracking by Natural Language (TNL2K) is constructed for the evaluation of tracking by natural language specification. TNL2K features:
10 PAPERS • 2 BENCHMARKS
YouTube-BoundingBoxes (YT-BB) is a large-scale data set of video URLs with densely-sampled object bounding box annotations. The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the MS COCO label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second.
5 PAPERS • NO BENCHMARKS YET
Multi-camera Multiple People Tracking (MMPTRACK) dataset has about 9.6 hours of videos, with over half a million frame-wise annotations. The dataset is densely annotated, e.g., per-frame bounding boxes and person identities are available, as well as camera calibration parameters. Our dataset is recorded with 15 frames per second (FPS) in five diverse and challenging environment settings., e.g., retail, lobby, industry, cafe, and office. This is by far the largest publicly available multi-camera multiple people tracking dataset.
2 PAPERS • NO BENCHMARKS YET
Dataset originally conceived for multi-face tracking/detection for highly crowded scenarios. In these scenarios, the face is the only part that can be used to track the individuals.
1 PAPER • NO BENCHMARKS YET
MobiFace is the first dataset for single face tracking in mobile situations. It consists of 80 unedited live-streaming mobile videos captured by 70 different smartphone users in fully unconstrained environments. Over 95K bounding boxes are manually labelled. The videos are carefully selected to cover typical smartphone usage. The videos are also annotated with 14 attributes, including 6 newly proposed attributes and 8 commonly seen in object tracking.
This dataset contains nine video sequences captured by a webcam for salient closed boundary tracking evaluation. Each sequence is about 30 sec (30 fps) and the frame size is 640×480 (width×height). There are 9598 frames in total. In each sequence, different motion styles such as translation, rotation and viewpoint changing are all performed.
The dataset is composed of 100 video sequences densely annotated with 60K bounding boxes, 17 sequence attributes, 13 action verb attributes and 29 target object attributes.