LSHTC is a dataset for large-scale text classification. The data used in the LSHTC challenges originates from two popular sources: the DBpedia and the ODP (Open Directory Project) directory, also known as DMOZ. DBpedia instances were selected from the english, non-regional Extended Abstracts provided by the DBpedia site. The DMOZ instances consist of either Content vectors, Description vectors or both. A Content vectors is obtained by directly indexing the web page using standard indexing chain (preprocessing, stemming/lemmatization, stop-word removal).
18 PAPERS • NO BENCHMARKS YET