SMC Text Corpus

Contents (As on March 4, 2019)

The text corpus contains running text from various free licensed sources. - The whole content of Malayalam Wikipedia extracted on January 1, 2019 - News/Article from various sources, source mentioned in respective files: - 251 Mb - 8,60,159 lines - 98,15,533 words - 10,11,11,885 characters

The word corpus contains - Classified lexicon prepared for Malaylam Morphology Analyser project - Unique words extracted from Malayalam Wikipedia, Wictionary etc. - 14,27,392 words

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

Language Modelling

Similar Datasets

IndicTTS

Usage

SMC Text Corpus

Contents (As on March 4, 2019)

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

IndicTTS

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages