Chinese Word Segmentation

48 papers with code • 6 benchmarks • 3 datasets

Chinese word segmentation is the task of splitting Chinese text (i.e. a sequence of Chinese characters) into words (Source: www.nlpprogress.com).

The Uncertainty-based Retrieval Framework for Ancient Chinese CWS and POS

Jihuai-wpy/bert-ancient-chinese LT4HALA (LREC) 2022

Automatic analysis for modern Chinese has greatly improved the accuracy of text mining in related fields, but the study of ancient Chinese is still relatively rare.

28
12 Oct 2023

LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation

tchayintr/latte-ptm-ws Journal of Natural Language Processing 2023

Our model employs the lattice structure to handle segmentation alternatives and utilizes graph neural networks along with an attention mechanism to attentively extract multi-granularity representation from the lattice for complementing character representations.

2
01 Jun 2023

Ancient Chinese Word Segmentation and Part-of-Speech Tagging Using Distant Supervision

farlit/acds 3 Mar 2023

To address this problem, we take advantage of the memorization effects of deep neural networks and a small amount of annotated data to get a model with much knowledge and a little noise, and then we use this model to relabel the ancient Chinese sentences in parallel corpus.

3
03 Mar 2023

GNN-SL: Sequence Labeling Based on Nearest Examples via GNN

shuhewang1998/gnn-sl 5 Dec 2022

To better handle long-tail cases in the sequence labeling (SL) task, in this work, we introduce graph neural networks sequence labeling (GNN-SL), which augments the vanilla SL model output with similar tagging examples retrieved from the whole training set.

5
05 Dec 2022

Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

modelscope/modelscope 27 Oct 2022

We apply BABERT for feature induction of Chinese sequence labeling tasks.

6,000
27 Oct 2022

Fast and Accurate End-to-End Span-based Semantic Role Labeling as Word-based Graph Parsing

zslin177/srl-as-gp COLING 2022

Moreover, we propose a simple constrained Viterbi procedure to ensure the legality of the output graph according to the constraints of the SRL structure.

17
06 Dec 2021

More than Text: Multi-modal Chinese Word Segmentation

manlp-suda/mcws ACL 2021

Chinese word segmentation (CWS) is undoubtedly an important basic task in natural language processing.

0
01 Aug 2021

Sub-Character Tokenization for Chinese Pretrained Language Models

thunlp/subchartokenization 1 Jun 2021

2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos.

31
01 Jun 2021

Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities

google-research-datasets/common-crawl-domain-names COLING 2020

Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search.

14
01 Dec 2020

Joint Chinese Word Segmentation and Part-of-speech Tagging via Multi-channel Attention of Character N-grams

cuhksz-nlp/mcasp COLING 2020

However, their work on modeling such contextual features is limited to concatenating the features or their embeddings directly with the input embeddings without distinguishing whether the contextual features are important for the joint task in the specific context.

11
01 Dec 2020