Chinese Word Segmentation
50 papers with code • 6 benchmarks • 3 datasets
Chinese word segmentation is the task of splitting Chinese text (i.e. a sequence of Chinese characters) into words (Source: www.nlpprogress.com).
Benchmarks
These leaderboards are used to track progress in Chinese Word Segmentation
Most implemented papers
ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations
Moreover, it is shown that reasonable performance can be obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data.
PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation
Through this method, we generate synthetic data using a large amount of unlabeled data in the target domain and then obtain a word segmentation model for the target domain.
Simplifying Neural Machine Translation with Addition-Subtraction Twin-Gated Recurrent Networks
Experiments on WMT14 translation tasks demonstrate that ATR-based neural machine translation can yield competitive performance on English- German and English-French language pairs in terms of both translation quality and speed.
Segmental Recurrent Neural Networks
Representations of the input segments (i. e., contiguous subsequences of the input) are computed by encoding their constituent tokens using bidirectional recurrent neural nets, and these "segment embeddings" are used to define compatibility scores with output labels.
LSICC: A Large Scale Informal Chinese Corpus
Deep learning based natural language processing model is proven powerful, but need large-scale dataset.
Glyce: Glyph-vectors for Chinese Character Representations
However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found.
Investigating Self-Attention Network for Chinese Word Segmentation
Neural network has become the dominant method for Chinese word segmentation.
Sub-Character Tokenization for Chinese Pretrained Language Models
2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos.
Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling
We apply BABERT for feature induction of Chinese sequence labeling tasks.
LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation
Our model employs the lattice structure to handle segmentation alternatives and utilizes graph neural networks along with an attention mechanism to attentively extract multi-granularity representation from the lattice for complementing character representations.