HMMs for Unsupervised Vietnamese WordSegmentation

16 May 2019  ·  Ba-Long Bui, Thi-Trang Nguyen, Huu-Hoang Nguyen, Kiem-Hieu Nguyen ·

Word segmentation is an important problem in nat-ural language processing. Most of previous works on Vietnameseword segmentation are supervised learning. In this paper, wepropose an unsupervised method for Vietnamese word segmenta-tion based on Hidden Markov Models. We naturally encode priorlinguistic knowledge into model learning. In decoding, we proposean enhancement of Viterbi decoding algorithm with externaltoken ordering statistics from Pointwise Mutual Information.Evaluation on benchmark datasets shows that the proposedmethod works reasonably well. Sourcecode is available at https://github.com/longbb/wordrecognition

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here