Chinese Word Segmentation

48 papers with code • 6 benchmarks • 3 datasets

Chinese word segmentation is the task of splitting Chinese text (i.e. a sequence of Chinese characters) into words (Source: www.nlpprogress.com).

Benchmarks

Add a Result

These leaderboards are used to track progress in Chinese Word Segmentation

Dataset	Best Model	Compare
MSR	BABERT-LE	See all
PKU	BABERT-LE	See all
CTB6	LATTE (Linguistic units, lattices, PTMs, GNNs)	See all
MSRA	BABERT-LE	See all
CITYU	WMSeg + ZEN	See all
AS	Glyce + BERT	See all

Datasets

Latest papers

Most implemented Social Latest No code

The Uncertainty-based Retrieval Framework for Ancient Chinese CWS and POS

Jihuai-wpy/bert-ancient-chinese • • LT4HALA (LREC) 2022

Automatic analysis for modern Chinese has greatly improved the accuracy of text mining in related fields, but the study of ancient Chinese is still relatively rare.

12 Oct 2023

Paper
Code

LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation

tchayintr/latte-ptm-ws • • Journal of Natural Language Processing 2023

Our model employs the lattice structure to handle segmentation alternatives and utilizes graph neural networks along with an attention mechanism to attentively extract multi-granularity representation from the lattice for complementing character representations.

01 Jun 2023

Paper
Code

Ancient Chinese Word Segmentation and Part-of-Speech Tagging Using Distant Supervision

farlit/acds • • 3 Mar 2023

To address this problem, we take advantage of the memorization effects of deep neural networks and a small amount of annotated data to get a model with much knowledge and a little noise, and then we use this model to relabel the ancient Chinese sentences in parallel corpus.

03 Mar 2023

Paper
Code

GNN-SL: Sequence Labeling Based on Nearest Examples via GNN

shuhewang1998/gnn-sl • • 5 Dec 2022

To better handle long-tail cases in the sequence labeling (SL) task, in this work, we introduce graph neural networks sequence labeling (GNN-SL), which augments the vanilla SL model output with similar tagging examples retrieved from the whole training set.

05 Dec 2022

Paper
Code

Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

modelscope/modelscope • • 27 Oct 2022

We apply BABERT for feature induction of Chinese sequence labeling tasks.

6,000

27 Oct 2022

Paper
Code

Fast and Accurate End-to-End Span-based Semantic Role Labeling as Word-based Graph Parsing

zslin177/srl-as-gp • • COLING 2022

Moreover, we propose a simple constrained Viterbi procedure to ensure the legality of the output graph according to the constraints of the SRL structure.

06 Dec 2021

Paper
Code

More than Text: Multi-modal Chinese Word Segmentation

manlp-suda/mcws • ACL 2021

Chinese word segmentation (CWS) is undoubtedly an important basic task in natural language processing.

01 Aug 2021

Paper
Code

Sub-Character Tokenization for Chinese Pretrained Language Models

thunlp/subchartokenization • • 1 Jun 2021

2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos.

01 Jun 2021

Paper
Code

Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities

google-research-datasets/common-crawl-domain-names • COLING 2020

Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search.

01 Dec 2020

Paper
Code

Joint Chinese Word Segmentation and Part-of-speech Tagging via Multi-channel Attention of Character N-grams

cuhksz-nlp/mcasp • • COLING 2020

However, their work on modeling such contextual features is limited to concatenating the features or their embeddings directly with the input embeddings without distinguishing whether the contextual features are important for the joint task in the specific context.

01 Dec 2020

Paper
Code

Chinese Word Segmentation

Benchmarks Add a Result

Datasets

Latest papers

Content

Benchmarks

Add a Result