Click to add a brief description of the dataset (Markdown and LaTeX enabled).
2 PAPERS • NO BENCHMARKS YET
…If you use this dataset in your work, please consider to cite: @inproceedings{ulucan2020large, title={A Large-Scale Dataset for Fish Segmentation and Classification}, author={Ulucan, Oguzhan and Karakaya This dataset was collected in order to carry out segmentation, feature extraction, and classification tasks and compare the common segmentation, feature extraction, and classification algorithms (Semantic Segmentation, Convolutional Neural Networks, Bag of Features).
1 PAPER • NO BENCHMARKS YET
The DISRPT 2019 workshop introduces the first iteration of a cross-formalism shared task on discourse unit segmentation. Since all major discourse parsing frameworks imply a segmentation of texts into segments, learning segmentations for and from diverse resources is a promising area for converging methods and insights. Because different corpora, languages and frameworks use different guidelines for segmentation, the shared task is meant to promote design of flexible methods for dealing with various guidelines, and help
4 PAPERS • NO BENCHMARKS YET
The DISRPT 2021 shared task, co-located with CODI 2021 at EMNLP, introduces the second iteration of a cross-formalism shared task on discourse unit segmentation and connective detection, as well as the
3 PAPERS • NO BENCHMARKS YET
This is a dataset for segmentation and classification of epistemic activities in diagnostic reasoning texts.
…Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words.
9 PAPERS • NO BENCHMARKS YET
…text segmentation and text segment classification) tasks and comprises 169 documents and gold standard annotations for page segments Partition (P2) contains 75 documents with a significantly richer
gRefCOCO is the first large-scale Generalized Referring Expression Segmentation dataset that contains multi-target, no-target, and single-target expressions.
20 PAPERS • 2 BENCHMARKS
The SPOT dataset contains 197 reviews originating from the Yelp'13 and IMDB collections (1), annotated with segment-level polarity labels (positive/neutral/negative). produced by a state-of-the-art RST parser This dataset is intended to aid sentiment analysis research and, in particular, the evaluation of methods that attempt to predict sentiment on a fine-grained, segment-level
BiasCorp is a dataset for racism detection containing 139,090 comments and news segment from three specific sources - Fox News, BreitbartNews and YouTube.
…In FUNSD and CORD, segment layout annotations are aligned with labeled entities, which makes them not reflect the reading order issue of NER on scanned VrDs, and thus are unsuitable for evaluating current Their segment layout annotations are aligned with real-world situations and entity mentions are labeled on words. The proposed CORD-r consists of 999 document samples including the image, layout annotation of segments and words, and labeled entities of 30 categories.
3 PAPERS • 1 BENCHMARK
…In FUNSD and CORD, segment layout annotations are aligned with labeled entities, which makes them not reflect the reading order issue of NER on scanned VrDs, and thus are unsuitable for evaluating current Their segment layout annotations are aligned with real-world situations and entity mentions are labeled on words. The proposed FUNSD-r consists of 199 document samples including the image, layout annotation of segments and words, and labeled entities of 3 categories.
scb-mt-en-th-2020 is an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled
…To create this dataset, a large number of text segments extracted from English language Wikipedia are perturbed, and then verified these with crowd-sourced annotations.
5 PAPERS • NO BENCHMARKS YET
…The corpus is extracted from Wikipedia, and 10K segments per language pair are annotated.
CH-SIMS is a Chinese single- and multimodal sentiment analysis dataset which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations.
13 PAPERS • 1 BENCHMARK
Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension
…Each segment is annotated for the presence of 11 emotions (angry, neutral, fear, happy, sad, disappointed, bored, disgusted, excited, surprised, fear and other)
6 PAPERS • 1 BENCHMARK
…The dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different
The University of Massachusetts Amherst citation field extraction dataset contains labels and segments for extracted citations from articles found on arXiv. Each citation string was labeled hierarchically, separating coarse-grain and fine-grain labeled segments. Dataset introduced in the following paper: Sam Anzaroot and Andrew McCallum.
0 PAPER • NO BENCHMARKS YET
…Object segmentation masks, object poses and object attributes are provided. In addition, synthetic images generated using 330 3D object models are used to augment the dataset. FewSOL dataset can be used to study a set of few-shot object recognition problems such as classification, detection and segmentation, shape reconstruction, pose estimation, keypoint correspondences and
…Basically, "rationales" are segments of the text that support an annotator's classification. Then the rationales would be segments of the text that support the claim (by an annotator) that the review is, indeed, positive. Here are some examples of positive rationales (the segments enclosed by double square brackets): [[you will enjoy the hell out of]] American Pie. fortunately, they [[managed to do it in an interesting
…pairs primarily aligned with English (39 out of 41) and mined using the parallel-data-crawling tool Bitextor which includes downloading documents, preprocessing and normalization, aligning documents and segments
57 PAPERS • NO BENCHMARKS YET
…It contains over 10M segments of multilingual open data. The data has been collected from sites allowing free use and reuse of its content, as well as from Public Sector web sites.
…To validate our approach we employ two popular video object segmentation datasets, DAVIS16 [38] and DAVIS17 [42]. For the multiple object video segmentation task we consider DAVIS17. As our goal is to segment objects in videos using language specifications, we augment all objects annotated with mask labels in DAVIS16 and DAVIS17 with non-ambiguous referring expressions. (We actually quantified that only∼ 15% of the collected descriptions become invalid over time and it does not affect strongly segmentation results as temporal consistency step helps to disambiguate some We believe the collected data will be of interest to segmentation as well as vision and language communities, providing an opportunity to explore language as alternative input for video object segmentation
75 PAPERS • 5 BENCHMARKS
…It includes tasks like: word segmentation, part of speech tagging, reading comprehension and document retrieval.
…Each image has high-quality segmentation mask, sketch, descriptive text, and image with transparent background.
27 PAPERS • 3 BENCHMARKS
…It contains human annotated ground truth labels for both bounding boxes and instance-wise segmentation masks.
…Data are segmented into sentences which are further word tokenized.
2 PAPERS • 12 BENCHMARKS
…In addition to the samples themselves, some entries in the dataset are accompanied by supplementary natural language descriptions, size measurements, and segmentation masks.
…This dataset consists in 4,000 English segments (4,500 tokens) that have been translated into each of 26 low-resource languages, as well as three higher-resource pivot languages (es, fr, hi).
…Annotations include: Multiple POS tags, morphological features and lemmatization Sentence segmentation and rough speech act Document structure in TEI XML (paragraphs, headings, figures, etc.)
8 PAPERS • 1 BENCHMARK
The SegmentedTables dataset is a collection of almost 2,000 tables extracted from 352 machine learning papers. Each table consists of rich text content, layout and caption.
We present YTSeg, a topically and structurally diverse benchmark for the text segmentation task based on YouTube transcriptions.
1 PAPER • 2 BENCHMARKS
…The dataset is segmented in 3 different subsets, each corresponding to a severity level of the LaTeXML software responsible for the HTML5 conversion.
…The dataset consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
…The dataset is segmented into 7.5K training problems and 1K test problems.
659 PAPERS • 1 BENCHMARK
…categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. 3) Benchmark tests performed on the proposed dataset to assess varied tasks and methods, including video temporal segmentation
…The datasets contains 6,892 segment-level summarization instances for training and evaluating of performance.
7 PAPERS • NO BENCHMARKS YET
…In this game, the first player views an image with a segmented target object and writes a natural language expression referring to that object. These datasets serve as valuable resources for tasks like referring expression segmentation, comprehension, and visual grounding in computer vision research.
302 PAPERS • 19 BENCHMARKS
…The goal of this work is to segment the sections of clinical medical domain documentation.
2 PAPERS • 2 BENCHMARKS
…A subset of 1.9M includes diverse annotations types. 15,851,536 boxes on 600 classes 2,785,498 instance segmentations on 350 classes 3,284,280 relationship annotations on 1,466 relationships 675,155
…Segments of each song are annotated as “voice” (sung or spoken) or “no-voice”. The songs constitute a total of about 6 hours of music.
…It consists of 29 time-lapse image sequences with various annotations (pixel-wise segmentation masks, object-wise bounding boxes, and tracking information), made publicly available to the scientific community
The dataset comes with a shot segmentation (around 1 million shots) for which we analyze content specifics and statistics.
…Each worker is assigned with one video segment and asked to write one question with four answer candidates (one correctand three distractors).
22 PAPERS • 2 BENCHMARKS
…The dataset also provides image segmentation masks, which labels persuasion strategies in the corresponding ad images on the test split.
…The screenplay (all dialogue and description parts of the movie) segmented into scenes (selected from the Scriptbase dataset). Gold scene-level TP labels for the screenplays of the test set.
11 PAPERS • NO BENCHMARKS YET