Chinese Word Segmentation

48 papers with code • 6 benchmarks • 3 datasets

Chinese word segmentation is the task of splitting Chinese text (i.e. a sequence of Chinese characters) into words (Source: www.nlpprogress.com).

Most implemented papers

ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations

sinovation/ZEN Findings of the Association for Computational Linguistics 2020

Moreover, it is shown that reasonable performance can be obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data.

PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation

lancopku/pkuseg-python 27 Jun 2019

Through this method, we generate synthetic data using a large amount of unlabeled data in the target domain and then obtain a word segmentation model for the target domain.

Simplifying Neural Machine Translation with Addition-Subtraction Twin-Gated Recurrent Networks

bzhangGo/zero EMNLP 2018

Experiments on WMT14 translation tasks demonstrate that ATR-based neural machine translation can yield competitive performance on English- German and English-French language pairs in terms of both translation quality and speed.

Segmental Recurrent Neural Networks

ykrmm/TREMBA 18 Nov 2015

Representations of the input segments (i. e., contiguous subsequences of the input) are computed by encoding their constituent tokens using bidirectional recurrent neural nets, and these "segment embeddings" are used to define compatibility scores with output labels.

LSICC: A Large Scale Informal Chinese Corpus

JaniceZhao/Douban-Dushu-Dataset 26 Nov 2018

Deep learning based natural language processing model is proven powerful, but need large-scale dataset.

Glyce: Glyph-vectors for Chinese Character Representations

ShannonAI/glyce NeurIPS 2019

However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found.

Investigating Self-Attention Network for Chinese Word Segmentation

gump88/SAN-CWS 26 Jul 2019

Neural network has become the dominant method for Chinese word segmentation.

Sub-Character Tokenization for Chinese Pretrained Language Models

thunlp/subchartokenization 1 Jun 2021

2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos.

Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

modelscope/AdaSeq 27 Oct 2022

We apply BABERT for feature induction of Chinese sequence labeling tasks.

LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation

tchayintr/latte-ptm-ws Journal of Natural Language Processing 2023

Our model employs the lattice structure to handle segmentation alternatives and utilizes graph neural networks along with an attention mechanism to attentively extract multi-granularity representation from the lattice for complementing character representations.