🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

66 dataset results for segmentation AND Texts

COST (COCO Segmentation Text)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 PAPERS • NO BENCHMARKS YET

A Large Scale Fish Dataset (A Large-Scale Dataset for Fish Segmentation and Classification)

…If you use this dataset in your work, please consider to cite: @inproceedings{ulucan2020large, title={A Large-Scale Dataset for Fish Segmentation and Classification}, author={Ulucan, Oguzhan and Karakaya This dataset was collected in order to carry out segmentation, feature extraction, and classification tasks and compare the common segmentation, feature extraction, and classification algorithms (Semantic Segmentation, Convolutional Neural Networks, Bag of Features).

1 PAPER • NO BENCHMARKS YET

DISRPT2019

DISRPT2019 (DISRPT2019 shared task on Discourse Unit Segmentation and Connective Detection)

The DISRPT 2019 workshop introduces the first iteration of a cross-formalism shared task on discourse unit segmentation. Since all major discourse parsing frameworks imply a segmentation of texts into segments, learning segmentations for and from diverse resources is a promising area for converging methods and insights. Because different corpora, languages and frameworks use different guidelines for segmentation, the shared task is meant to promote design of flexible methods for dealing with various guidelines, and help

4 PAPERS • NO BENCHMARKS YET

DISRPT2021

DISRPT2021 (DISRPT2021 shared task on Discourse Unit Segmentation, Connective Detection and Discourse Relation Classification)

The DISRPT 2021 shared task, co-located with CODI 2021 at EMNLP, introduces the second iteration of a cross-formalism shared task on discourse unit segmentation and connective detection, as well as the

3 PAPERS • NO BENCHMARKS YET

Famulus

This is a dataset for segmentation and classification of epistemic activities in diagnostic reasoning texts.

3 PAPERS • NO BENCHMARKS YET

How2R

…Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words.

9 PAPERS • NO BENCHMARKS YET

CareerCoach 2022

…text segmentation and text segment classification) tasks and comprises 169 documents and gold standard annotations for page segments Partition (P2) contains 75 documents with a significantly richer

1 PAPER • NO BENCHMARKS YET

gRefCOCO

gRefCOCO is the first large-scale Generalized Referring Expression Segmentation dataset that contains multi-target, no-target, and single-target expressions.

20 PAPERS • 2 BENCHMARKS

SPOT (Sentiment Polarity Annotations Dataset)

The SPOT dataset contains 197 reviews originating from the Yelp'13 and IMDB collections (1), annotated with segment-level polarity labels (positive/neutral/negative). produced by a state-of-the-art RST parser This dataset is intended to aid sentiment analysis research and, in particular, the evaluation of methods that attempt to predict sentiment on a fine-grained, segment-level

3 PAPERS • NO BENCHMARKS YET

BiasCorp

BiasCorp is a dataset for racism detection containing 139,090 comments and news segment from three specific sources - Fox News, BreitbartNews and YouTube.

2 PAPERS • NO BENCHMARKS YET

CORD-r

…In FUNSD and CORD, segment layout annotations are aligned with labeled entities, which makes them not reflect the reading order issue of NER on scanned VrDs, and thus are unsuitable for evaluating current Their segment layout annotations are aligned with real-world situations and entity mentions are labeled on words. The proposed CORD-r consists of 999 document samples including the image, layout annotation of segments and words, and labeled entities of 30 categories.

3 PAPERS • 1 BENCHMARK

FUNSD-r

…In FUNSD and CORD, segment layout annotations are aligned with labeled entities, which makes them not reflect the reading order issue of NER on scanned VrDs, and thus are unsuitable for evaluating current Their segment layout annotations are aligned with real-world situations and entity mentions are labeled on words. The proposed FUNSD-r consists of 199 document samples including the image, layout annotation of segments and words, and labeled entities of 3 categories.

3 PAPERS • 1 BENCHMARK

scb-mt-en-th-2020

scb-mt-en-th-2020 is an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled

1 PAPER • NO BENCHMARKS YET

HaDes

…To create this dataset, a large number of text segments extracted from English language Wikipedia are perturbed, and then verified these with crowd-sourced annotations.

5 PAPERS • NO BENCHMARKS YET

MLQE

MLQE (MultiLingual Quality Estimation)

…The corpus is extracted from Wikipedia, and 10K segments per language pair are annotated.

5 PAPERS • NO BENCHMARKS YET

CH-SIMS

CH-SIMS is a Chinese single- and multimodal sentiment analysis dataset which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations.

13 PAPERS • 1 BENCHMARK

CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings

Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension

1 PAPER • NO BENCHMARKS YET

LSSED

…Each segment is annotated for the presence of 11 emotions (angry, neutral, fear, happy, sad, disappointed, bored, disgusted, excited, surprised, fear and other)

6 PAPERS • 1 BENCHMARK

PropSegmEnt

…The dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different

1 PAPER • NO BENCHMARKS YET

UMass Citation Field Extraction

The University of Massachusetts Amherst citation field extraction dataset contains labels and segments for extracted citations from articles found on arXiv. Each citation string was labeled hierarchically, separating coarse-grain and fine-grain labeled segments. Dataset introduced in the following paper: Sam Anzaroot and Andrew McCallum.

0 PAPER • NO BENCHMARKS YET

FewSOL (A Dataset for Few-Shot Object Learning in Robotic Environments)

…Object segmentation masks, object poses and object attributes are provided. In addition, synthetic images generated using 330 3D object models are used to augment the dataset. FewSOL dataset can be used to study a set of few-shot object recognition problems such as classification, detection and segmentation, shape reconstruction, pose estimation, keypoint correspondences and

4 PAPERS • NO BENCHMARKS YET

Movie Reviews

Movie Reviews (Movie Review Polarity Dataset Enriched with "Annotator Rationales")

…Basically, "rationales" are segments of the text that support an annotator's classification. Then the rationales would be segments of the text that support the claim (by an annotator) that the review is, indeed, positive. Here are some examples of positive rationales (the segments enclosed by double square brackets): [[you will enjoy the hell out of]] American Pie. fortunately, they [[managed to do it in an interesting

1 PAPER • NO BENCHMARKS YET

ParaCrawl

…pairs primarily aligned with English (39 out of 41) and mined using the parallel-data-crawling tool Bitextor which includes downloading documents, preprocessing and normalization, aligning documents and segments

57 PAPERS • NO BENCHMARKS YET

Tilde MODEL Corpus

Tilde MODEL Corpus (Tilde Multilingual Open Data for European Languages)

…It contains over 10M segments of multilingual open data. The data has been collected from sites allowing free use and reuse of its content, as well as from Public Sector web sites.

2 PAPERS • NO BENCHMARKS YET

Referring Expressions for DAVIS 2016 & 2017

…To validate our approach we employ two popular video object segmentation datasets, DAVIS16 [38] and DAVIS17 [42]. For the multiple object video segmentation task we consider DAVIS17. As our goal is to segment objects in videos using language specifications, we augment all objects annotated with mask labels in DAVIS16 and DAVIS17 with non-ambiguous referring expressions. (We actually quantified that only∼ 15% of the collected descriptions become invalid over time and it does not affect strongly segmentation results as temporal consistency step helps to disambiguate some We believe the collected data will be of interest to segmentation as well as vision and language communities, providing an opportunity to explore language as alternative input for video object segmentation

75 PAPERS • 5 BENCHMARKS

CUGE

…It includes tasks like: word segmentation, part of speech tagging, reading comprehension and document retrieval.

4 PAPERS • NO BENCHMARKS YET

Multi-Modal CelebA-HQ

…Each image has high-quality segmentation mask, sketch, descriptive text, and image with transparent background.

27 PAPERS • 3 BENCHMARKS

FSOCO

…It contains human annotated ground truth labels for both bounding boxes and instance-wise segmentation masks.

1 PAPER • NO BENCHMARKS YET

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

…Data are segmented into sentences which are further word tokenized.

2 PAPERS • 12 BENCHMARKS

MineralImage5k (Benchmark for 5k raw mineral species recognition)

…In addition to the samples themselves, some entries in the dataset are accompanied by supplementary natural language descriptions, size measurements, and segmentation masks.

1 PAPER • NO BENCHMARKS YET

GATITOS

GATITOS (Google's Additional Translations Into Tail-languages: Often Short)

…This dataset consists in 4,000 English segments (4,500 tokens) that have been translated into each of 26 low-resource languages, as well as three higher-resource pivot languages (es, fr, hi).

1 PAPER • NO BENCHMARKS YET

GUM (Georgetown University Multilayer corpus)

…Annotations include: Multiple POS tags, morphological features and lemmatization Sentence segmentation and rough speech act Document structure in TEI XML (paragraphs, headings, figures, etc.)

8 PAPERS • 1 BENCHMARK

SegmentedTables

The SegmentedTables dataset is a collection of almost 2,000 tables extracted from 352 machine learning papers. Each table consists of rich text content, layout and caption.

2 PAPERS • NO BENCHMARKS YET

YTSeg

We present YTSeg, a topically and structurally diverse benchmark for the text segmentation task based on YouTube transcriptions.

1 PAPER • 2 BENCHMARKS

arXMLiv:08.2018

…The dataset is segmented in 3 different subsets, each corresponding to a severity level of the LaTeXML software responsible for the HTML5 conversion.

1 PAPER • NO BENCHMARKS YET

HiREST (HIerarchical REtrieval and STep-captioning)

…The dataset consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.

2 PAPERS • NO BENCHMARKS YET

GSM8K

…The dataset is segmented into 7.5K training problems and 1K test problems.

659 PAPERS • 1 BENCHMARK

MultiSum

…categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. 3) Benchmark tests performed on the proposed dataset to assess varied tasks and methods, including video temporal segmentation

1 PAPER • NO BENCHMARKS YET

MeetingBank

…The datasets contains 6,892 segment-level summarization instances for training and evaluating of performance.

7 PAPERS • NO BENCHMARKS YET

RefCOCO

…In this game, the first player views an image with a segmented target object and writes a natural language expression referring to that object. These datasets serve as valuable resources for tasks like referring expression segmentation, comprehension, and visual grounding in computer vision research.

302 PAPERS • 19 BENCHMARKS

MedSecId

…The goal of this work is to segment the sections of clinical medical domain documentation.

2 PAPERS • 2 BENCHMARKS

Open Images V7

…A subset of 1.9M includes diverse annotations types. 15,851,536 boxes on 600 classes 2,785,498 instance segmentations on 350 classes 3,284,280 relationship annotations on 1,466 relationships 675,155

4 PAPERS • NO BENCHMARKS YET

Jamendo Corpus

…Segments of each song are annotated as “voice” (sung or spoken) or “no-voice”. The songs constitute a total of about 6 hours of music.

3 PAPERS • NO BENCHMARKS YET

ALFI (Annotations for Label-Free Images)

…It consists of 29 time-lapse image sequences with various annotations (pixel-wise segmentation masks, object-wise bounding boxes, and tracking information), made publicly available to the scientific community

0 PAPER • NO BENCHMARKS YET

V3C1

V3C1 (the Vimeo Creative Commons Collection 1)

The dataset comes with a shot segmentation (around 1 million shots) for which we analyze content specifics and statistics.

1 PAPER • NO BENCHMARKS YET

How2QA

…Each worker is assigned with one video segment and asked to write one question with four answer candidates (one correctand three distractors).

22 PAPERS • 2 BENCHMARKS

Persuasion Strategies

…The dataset also provides image segmentation masks, which labels persuasion strategies in the corresponding ad images on the test split.

2 PAPERS • NO BENCHMARKS YET

TRIPOD (TuRnIng POint Dataset)

…The screenplay (all dialogue and description parts of the movie) segmented into scenes (selected from the Scriptbase dataset). Gold scene-level TP labels for the screenplays of the test set.

11 PAPERS • NO BENCHMARKS YET

Datasets

66 dataset results for segmentation AND Texts