Text Segmentation
40 papers with code • 3 benchmarks • 7 datasets
Text segmentation deals with the correct division of a document into semantically coherent blocks.
Datasets
Most implemented papers
CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases
We propose a novel domain-independent framework, called CoType, that runs a data-driven text segmentation algorithm to extract entity mentions, and jointly embeds entity mentions, relation mentions, text features and type labels into two low-dimensional spaces (for entity and relation mentions respectively), where, in each space, objects whose types are close will also have similar representations.
Sequence Modeling via Segmentations
The probability of a segmented sequence is calculated as the product of the probabilities of all its segments, where each segment is modeled using existing tools such as recurrent neural networks.
Text Segmentation as a Supervised Learning Task
Text segmentation, the task of dividing a document into contiguous segments based on its semantic structure, is a longstanding challenge in language understanding.
WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total.
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference.
Text Segmentation based on Semantic Word Embeddings
We explore the use of semantic word embeddings in text segmentation algorithms, including the C99 segmentation algorithm and new algorithms inspired by the distributed word vector representation.
Khmer Word Segmentation Using Conditional Random Fields
The trained CRF segmenter was compared empirically to a baseline approach based on maximum matching that used a dictionary extracted from the manually segmented corpus.
An efficient way for segmentation of Bangla characters in printed document using curved scanning
The preeminent reason for poor output in Optical Character Recognition (OCR) for Bangla text is introduced by segmentation related error.