Glyph2Vec: Learning Chinese Out-of-Vocabulary Word Embedding from Glyphs
Chinese NLP applications that rely on large text often contain huge amounts of vocabulary which are sparse in corpus. We show that characters{'} written form, \textit{Glyphs}, in ideographic languages could carry rich semantics. We present a multi-modal model, \textit{Glyph2Vec}, to tackle Chinese out-of-vocabulary word embedding problem. Glyph2Vec extracts visual features from word glyphs to expand current word embedding space for out-of-vocabulary word embedding, without the need of accessing any corpus, which is useful for improving Chinese NLP systems, especially for low-resource scenarios. Experiments across different applications show the significant effectiveness of our model.
PDF Abstract