Text Segmentation

40 papers with code • 3 benchmarks • 7 datasets

Text segmentation deals with the correct division of a document into semantically coherent blocks.

Most implemented papers

CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases

shanzhenren/CoType 27 Oct 2016

We propose a novel domain-independent framework, called CoType, that runs a data-driven text segmentation algorithm to extract entity mentions, and jointly embeds entity mentions, relation mentions, text features and type labels into two low-dimensional spaces (for entity and relation mentions respectively), where, in each space, objects whose types are close will also have similar representations.

Sequence Modeling via Segmentations

posenhuang/NPMT ICML 2017

The probability of a segmented sequence is calculated as the product of the probabilities of all its segments, where each segment is modeled using existing tools such as recurrent neural networks.

Text Segmentation as a Supervised Learning Task

koomri/text-segmentation NAACL 2018

Text segmentation, the task of dividing a document into contiguous segments based on its semantic structure, is a longstanding challenge in language understanding.

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

wenet-e2e/wenetspeech 7 Oct 2021

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total.

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

yinyueqin/denserewardrlhf-ppo 6 Jan 2025

Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference.

Text Segmentation based on Semantic Word Embeddings

chschock/textsplit 18 Mar 2015

We explore the use of semantic word embeddings in text segmentation algorithms, including the C99 segmentation algorithm and new algorithms inspired by the distributed word vector representation.

Khmer Word Segmentation Using Conditional Random Fields

VietHoang1512/khmer-nltk 15 Oct 2015

The trained CRF segmenter was compared empirically to a baseline approach based on maximum matching that used a dictionary extracted from the manually segmented corpus.

An efficient way for segmentation of Bangla characters in printed document using curved scanning

Fazle-Rabby-Sourav/Bangla-Optical-Character-Recognition-System 13 May 2016

The preeminent reason for poor output in Optical Character Recognition (OCR) for Bangla text is introduced by segmentation related error.