SMC Text Corpus

Contents (As on March 4, 2019)

The text corpus contains running text from various free licensed sources. - The whole content of Malayalam Wikipedia extracted on January 1, 2019 - News/Article from various sources, source mentioned in respective files: - 251 Mb - 8,60,159 lines - 98,15,533 words - 10,11,11,885 characters

The word corpus contains - Classified lexicon prepared for Malaylam Morphology Analyser project - Unique words extracted from Malayalam Wikipedia, Wictionary etc. - 14,27,392 words

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


Modalities


Languages