Object Tracking Benchmark (OTB) is a visual tracking benchmark that is widely used to evaluate the performance of a visual tracking algorithm. The dataset contains a total of 100 sequences and each is annotated frame-by-frame with bounding boxes and 11 challenge attributes. OTB-2013 dataset contains 51 sequences and the OTB-2015 dataset contains all 100 sequences of the OTB dataset.
381 PAPERS • 4 BENCHMARKS
LaSOT is a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT one of the largest densely annotated tracking benchmark. The average video length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view.
209 PAPERS • 2 BENCHMARKS
The GOT-10k dataset contains more than 10,000 video segments of real-world moving objects and over 1.5 million manually labelled bounding boxes. The dataset contains more than 560 classes of real-world moving objects and 80+ classes of motion patterns.
184 PAPERS • 2 BENCHMARKS
OTB-2015, also referred as Visual Tracker Benchmark, is a visual tracking dataset. It contains 100 commonly used video sequences for evaluating visual tracking. Image Source: http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html
171 PAPERS • 1 BENCHMARK
TrackingNet is a large-scale tracking dataset consisting of videos in the wild. It has a total of 30,643 videos split into 30,132 training videos and 511 testing videos, with an average of 470,9 frames.
167 PAPERS • 2 BENCHMARKS
Youtube-VOS is a Video Object Segmentation dataset that contains 4,453 videos - 3,471 for training, 474 for validation, and 508 for testing. The training and validation videos have pixel-level ground truth annotations for every 5th frame (6 fps). It also contains Instance Segmentation annotations. It has more than 7,800 unique objects, 190k high-quality manual annotations and more than 340 minutes in duration.
155 PAPERS • 10 BENCHMARKS
VOT2018 is a dataset for visual object tracking. It consists of 60 challenging videos collected from real-life datasets.
121 PAPERS • 1 BENCHMARK
VOT2016 is a video dataset for visual object tracking. It contains 60 video clips and 21,646 corresponding ground truth maps with pixel-wise annotation of salient objects.
110 PAPERS • 1 BENCHMARK
OTB2013 is the previous version of the current OTB2015 Visual Tracker Benchmark. It contains only 50 tracking sequences, as opposed to the 100 sequences in the current version of the benchmark.
109 PAPERS • 2 BENCHMARKS
VOT2017 is a Visual Object Tracking dataset for different tasks that contains 60 short sequences annotated with 6 different attributes.
53 PAPERS • 2 BENCHMARKS
Tracking by Natural Language (TNL2K) is constructed for the evaluation of tracking by natural language specification. TNL2K features:
29 PAPERS • 2 BENCHMARKS
Source: https://www.vicos.si/Projects/CDTB 4.2 State-of-the-art Comparison A TH CTB (color-and-depth visual object tracking) dataset is recorded by several passive and active RGB-D setups and contains indoor as well as outdoor sequences acquired in direct sunlight. The sequences were recorded to contain significant object pose change, clutter, occlusion, and periods of long-term target absence to enable tracker evaluation under realistic conditions. Sequences are per-frame annotated with 13 visual attributes for detailed analysis. It contains around 100,000 samples. Image Source: https://www.vicos.si/Projects/CDTB
16 PAPERS • NO BENCHMARKS YET
A new long video dataset and benchmark for single object tracking. The dataset consists of 50 HD videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing generic datasets for visual tracking.
14 PAPERS • NO BENCHMARKS YET
The dataset comprises 25 short sequences showing various objects in challenging backgrounds. Eight sequences are from the VOT2013 challenge (bolt, bicycle, david, diving, gymnastics, hand, sunshade, woman). The new sequences show complementary objects and backgrounds, for example a fish underwater or a surfer riding a big wave. The sequences were chosen from a large pool of sequences using a methodology based on clustering visual features of object and background so that those 25 sequences sample evenly well the existing pool.
12 PAPERS • 1 BENCHMARK
TREK-150 is a benchmark dataset for object tracking in First Person Vision (FPV) videos composed of 150 densely annotated video sequences.
6 PAPERS • NO BENCHMARKS YET
VOT2020 is a Visual Object Tracking benchmark for short-term tracking in RGB.
5 PAPERS • 1 BENCHMARK
VOT2019 is a Visual Object Tracking benchmark for short-term tracking in RGB.
4 PAPERS • 1 BENCHMARK
The AU-AIR is a multi-modal aerial dataset captured by a UAV. Having visual data, object annotations, and flight data (time, GPS, altitude, IMU sensor data, velocities), AU-AIR meets vision and robotics for UAVs.
2 PAPERS • NO BENCHMARKS YET
Informative Tracking Benchmark (ITB) is a small and informative tracking benchmark with 7% out of 1.2 M frames of existing and newly collected datasets, which enables efficient evaluation while ensuring effectiveness. Specifically, the authors designed a quality assessment mechanism to select the most informative sequences from existing benchmarks taking into account 1) challenging level, 2) discriminative strength, 3) and density of appearance variations. Furthermore, they collect additional sequences to ensure the diversity and balance of tracking scenarios, leading to a total of 20 sequences for each scenario.
2 PAPERS • 1 BENCHMARK
The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set of datasets, e.g. Microsoft COCO and Pascal VOC. Due to image retrieval and annotation costs, these datasets consist largely of images found on the web and do not represent many real-life domains that are being modelled in practice, e.g. satellite, microscopic and gaming, making it difficult to assert the degree of generalization learned by the model.
2 PAPERS • 1 BENCHMARK