Chinese Word Segmentation
48 papers with code • 6 benchmarks • 3 datasets
Chinese word segmentation is the task of splitting Chinese text (i.e. a sequence of Chinese characters) into words (Source: www.nlpprogress.com).
Benchmarks
These leaderboards are used to track progress in Chinese Word Segmentation
Latest papers
The Uncertainty-based Retrieval Framework for Ancient Chinese CWS and POS
Automatic analysis for modern Chinese has greatly improved the accuracy of text mining in related fields, but the study of ancient Chinese is still relatively rare.
LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation
Our model employs the lattice structure to handle segmentation alternatives and utilizes graph neural networks along with an attention mechanism to attentively extract multi-granularity representation from the lattice for complementing character representations.
Ancient Chinese Word Segmentation and Part-of-Speech Tagging Using Distant Supervision
To address this problem, we take advantage of the memorization effects of deep neural networks and a small amount of annotated data to get a model with much knowledge and a little noise, and then we use this model to relabel the ancient Chinese sentences in parallel corpus.
GNN-SL: Sequence Labeling Based on Nearest Examples via GNN
To better handle long-tail cases in the sequence labeling (SL) task, in this work, we introduce graph neural networks sequence labeling (GNN-SL), which augments the vanilla SL model output with similar tagging examples retrieved from the whole training set.
Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling
We apply BABERT for feature induction of Chinese sequence labeling tasks.
Fast and Accurate End-to-End Span-based Semantic Role Labeling as Word-based Graph Parsing
Moreover, we propose a simple constrained Viterbi procedure to ensure the legality of the output graph according to the constraints of the SRL structure.
More than Text: Multi-modal Chinese Word Segmentation
Chinese word segmentation (CWS) is undoubtedly an important basic task in natural language processing.
Sub-Character Tokenization for Chinese Pretrained Language Models
2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos.
Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities
Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search.
Joint Chinese Word Segmentation and Part-of-speech Tagging via Multi-channel Attention of Character N-grams
However, their work on modeling such contextual features is limited to concatenating the features or their embeddings directly with the input embeddings without distinguishing whether the contextual features are important for the joint task in the specific context.