|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Chinese word segmentation (CWS) is a fundamental step of Chinese natural language processing.
Moreover, it is shown that reasonable performance can be obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data.
We present a simple yet elegant solution to train a single joint model on multi-criteria corpora for Chinese Word Segmentation (CWS).
However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found.
CHINESE WORD SEGMENTATION DEPENDENCY PARSING DOCUMENT CLASSIFICATION IMAGE CLASSIFICATION LANGUAGE MODELLING MACHINE TRANSLATION MULTI-TASK LEARNING PART-OF-SPEECH TAGGING SEMANTIC ROLE LABELING SEMANTIC TEXTUAL SIMILARITY SENTENCE CLASSIFICATION SENTIMENT ANALYSIS
However, existing methods for Chinese NER either do not exploit word boundary information from CWS or cannot filter the specific information of CWS.
Neural models with minimal feature engineering have achieved competitive performance against traditional methods for the task of Chinese word segmentation.
Most previous approaches to Chinese word segmentation formalize this problem as a character-based sequence labeling task where only contextual information within fixed sized local windows and simple interactions between adjacent tags can be captured.
Previous lattice LSTM model takes word embeddings as the lexicon input, we prove that subword encoding can give the comparable performance and has the benefit of not relying on any external segmentor.
In recent years, after the neural-network-based method was proposed, the accuracy of the Chinese word segmentation task has made great progress.
As far as we know, we are the first to propose a neural model for unsupervised CWS and achieve competitive performance to the state-of-the-art statistical models on four different datasets from SIGHAN 2005 bakeoff.