Automated Generation of Multilingual Clusters for the Evaluation of Distributed Representations

4 Nov 2016  ·  Philip Blair, Yuval Merhav, Joel Barry ·

We propose a language-agnostic way of automatically generating sets of semantically similar clusters of entities along with sets of "outlier" elements, which may then be used to perform an intrinsic evaluation of word embeddings in the outlier detection task. We used our methodology to create a gold-standard dataset, which we call WikiSem500, and evaluated multiple state-of-the-art embeddings. The results show a correlation between performance on this dataset and performance on sentiment analysis.

PDF Abstract

Datasets


Introduced in the Paper:

WikiSem500

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here