Chinese Word Segmentation
46 papers with code • 6 benchmarks • 3 datasets
Chinese word segmentation is the task of splitting Chinese text (i.e. a sequence of Chinese characters) into words (Source: www.nlpprogress.com).
Most implemented papers
ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations
Moreover, it is shown that reasonable performance can be obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data.
Simplifying Neural Machine Translation with Addition-Subtraction Twin-Gated Recurrent Networks
Experiments on WMT14 translation tasks demonstrate that ATR-based neural machine translation can yield competitive performance on English- German and English-French language pairs in terms of both translation quality and speed.
PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation
Through this method, we generate synthetic data using a large amount of unlabeled data in the target domain and then obtain a word segmentation model for the target domain.
Segmental Recurrent Neural Networks
Representations of the input segments (i. e., contiguous subsequences of the input) are computed by encoding their constituent tokens using bidirectional recurrent neural nets, and these "segment embeddings" are used to define compatibility scores with output labels.
LSICC: A Large Scale Informal Chinese Corpus
Deep learning based natural language processing model is proven powerful, but need large-scale dataset.
Glyce: Glyph-vectors for Chinese Character Representations
However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found.
Investigating Self-Attention Network for Chinese Word Segmentation
Neural network has become the dominant method for Chinese word segmentation.
Sub-Character Tokenization for Chinese Pretrained Language Models
2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos.
Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling
We apply BABERT for feature induction of Chinese sequence labeling tasks.
Exploring Segment Representations for Neural Segmentation Models
Many natural language processing (NLP) tasks can be generalized into segmentation problem.