We train these models primarily using the out of domain data and employ simple domain adaptation techniques based on the characteristics of the in-domain dataset.
The monolingual Hindi BERT models currently available on the model hub do not perform better than the multi-lingual models on downstream tasks.
We evaluate these models on real text classification datasets to show embeddings obtained from synthetic data training are generalizable to real datasets as well and thus represent an effective training strategy for low-resource languages.
Pre-training large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks.
This step helps cover the target domain vocabulary and improves the model performance on the downstream task.
In this work, we deeply explore a wide range of challenges in automatic hate speech detection by presenting a hierarchical organization of these problems.
These models are based on Listen-Attend-Spell (LAS) encoder-decoder architecture.
The parallel data in the target domain is then used to fine-tune the final dense layer of generic ASR models.
We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi.
We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script.
Named Entity Recognition (NER) is a basic NLP task and finds major applications in conversational and search systems.
In this work, we present L3Cube-MahaHate, the first major Hate Speech Dataset in Marathi.
In this work, we consider NER for low-resource Indian languages like Hindi and Marathi.
Along with the hierarchical approaches, this work also provides a comparison of different deep learning algorithms like USE, BERT, HAN, Longformer, and BigBird for long document classification.
In this work, we carry out a data-focused study evaluating the impact of systematic practical perturbations on the performance of the deep learning based text classification models like CNN, LSTM, and BERT-based algorithms.
We re-iterate that long document classification is a simpler task and even basic algorithms perform competitively with BERT-based approaches on most of the datasets.
The basic models based on CNN and LSTM are augmented with fast text word embeddings.
While encryption is the best way to ensure image security, full encryption and decryption is a computationally-intensive process.
We have also compared our image-based hierarchical neural network model with simple image-based CNN architecture and text-based CNN and LSTM models to highlight its novelty and efficiency.
We show that these systems are over-reliant on the important words present in the text that are useful for classification.
The Marathi language is one of the prominent languages used in India.
The pre-trained Hindi fast text word embeddings by IndicNLP and Facebook are used in conjunction with CNN and LSTM models.
These platforms have led to an increase in the creation and spread of fake news.
The shared task aims to build a translation system for Indian languages in specific domains like Artificial Intelligence (AI) and Chemistry using a small in-domain parallel corpus.
We evaluate different deep learning models and input representation combinations for this task.
Usage of deep learning in text processing has revolutionized the techniques for text processing and achieved remarkable results.