11 dataset results for Instance Segmentation AND Videos

TYC Dataset (The TYC Dataset for Understanding Instance-Level Semantics and Motions of Cells in Microstructures)

We introduce the trapped yeast cell (TYC) dataset, a novel dataset for understanding instance-level semantics and motions of cells in microstructures. We release $105$ dense annotated high-resolution brightfield microscopy images, including about $19$k instance masks. We also release $261$ curated video clips composed of $1293$ high-resolution microscopy images to facilitate unsupervised understanding of cell motions and morphology.

1 PAPER • NO BENCHMARKS YET

Plittersdorf

A set of 221 stereo videos captured by the SOCRATES stereo camera trap in a wildlife park in Bonn, Germany between February and July of 2022. A subset of frames is labeled with instance annotations in the COCO format.

2 PAPERS • NO BENCHMARKS YET

HT1080WT cells - 3D collagen type I matrices

HT1080WT cells - 3D collagen type I matrices (HT1080WT cells embedded in 3D collagen type I matrices - manual annotations for cell instance segmentation and tracking)

Human fibrosarcoma HT1080WT (ATCC) cells at low cell densities embedded in 3D collagen type I matrices [1]. The time-lapse videos were recorded every 2 minutes for 16.7 hours and covered a field of view of 1002 pixels × 1004 pixels with a pixel size of 0.802 μm/pixel The videos were pre-processed to correct frame-to-frame drift artifacts, resulting in a final size of 983 pixels × 985 pixels pixels.

1 PAPER • NO BENCHMARKS YET

SB20 (Sugar Beet 2020 University of Bonn)

Video sequences captured at a field on Campus Kleinaltendorf (CKA), University of Bonn, captured by BonBot-I, an autonomous weeding robot. The data was captured by mounting an Intel RealSense D435i sensor with a nadir view of the ground.

2 PAPERS • NO BENCHMARKS YET

TikTok Dataset (Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos)

We learn high fidelity human depths by leveraging a collection of social media dance videos scraped from the TikTok mobile social networking application. It is by far one of the most popular video sharing applications across generations, which include short videos (10-15 seconds) of diverse dance challenges as shown above. We manually find more than 300 dance videos that capture a single person performing dance moves from TikTok dance challenge compilations for each month, variety, type of dances, which are moderate movements that do not generate excessive motion blur. For each video, we extract RGB images at 30 frame per second, resulting in more than 100K images. We segmented these images using Removebg application, and computed the UV coordinates from DensePose.

3 PAPERS • NO BENCHMARKS YET

Lindenthal Camera Traps

This data set contains 775 video sequences, captured in the wildlife park Lindenthal (Cologne, Germany) as part of the AMMOD project, using an Intel RealSense D435 stereo camera. In addition to color and infrared images, the D435 is able to infer the distance (or “depth”) to objects in the scene using stereo vision. Observed animals include various birds (at daytime) and mammals such as deer, goats, sheep, donkeys, and foxes (primarily at nighttime). A subset of 412 images is annotated with a total of 1038 individual animal annotations, including instance masks, bounding boxes, class labels, and corresponding track IDs to identify the same individual over the entire video.

1 PAPER • NO BENCHMARKS YET

BUP20 (Sweet Pepper 2020 University of Bonn)

Video sequences from a glasshouse environment in Campus Kleinaltendorf(CKA), University of Bonn, captured by PATHoBot, a glasshouse monitoring robot.

4 PAPERS • NO BENCHMARKS YET

BDD100K

Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study multitask learning for autonomous driving. Researchers are usually constrained to study a small set of problems on one dataset, while real-world computer vision applications require performing tasks of various complexities. We construct BDD100K, the largest driving video dataset with 100K videos and 10 tasks to evaluate the exciting progress of image recognition algorithms on autonomous driving. The dataset possesses geographic, environmental, and weather diversity, which is useful for training models that are less likely to be surprised by new conditions. Based on this diverse dataset, we build a benchmark for heterogeneous multitask learning and study how to solve the tasks together. Our experiments show that special training strategies are needed for existing models to perform such heterogeneous tasks. BDD100K opens the door for future studies in thi

366 PAPERS • 16 BENCHMARKS

YouTube-VIS 2019

YouTubeVIS is a new dataset tailored for tasks like simultaneous detection, segmentation and tracking of object instances in videos and is collected based on the current largest video object segmentation dataset YouTubeVOS.

147 PAPERS • 2 BENCHMARKS

MS COCO (Microsoft Common Objects in Context)

The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

10,315 PAPERS • 93 BENCHMARKS

UVO (Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation)

UVO is a new benchmark for open-world class-agnostic object segmentation in videos. Besides shifting the problem focus to the open-world setup, UVO is significantly larger, providing approximately 8 times more videos compared with DAVIS, and 7 times more mask (instance) annotations per video compared with YouTube-VOS and YouTube-VIS. UVO is also more challenging as it includes many videos with crowded scenes and complex background motions. Some highlights of the dataset include:

23 PAPERS • 3 BENCHMARKS