ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation

We propose ThaiLMCut, a semi-supervised approach for Thai word segmentation which utilizes a bi-directional character language model (LM) as a way to leverage useful linguistic knowledge from unlabeled data. After the language model is trained on substantial unlabeled corpora, the weights of its embedding and recurrent layers are transferred to a supervised word segmentation model which continues fine-tuning them on a word segmentation task... (read more)

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK BENCHMARK
Thai Word Segmentation BEST-2010 ThaiLMCut F1-Score 0.9878 # 2

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet