Dialect Identification
25 papers with code • 0 benchmarks • 17 datasets
Dialectal Arabic Identification
Benchmarks
These leaderboards are used to track progress in Dialect Identification
Datasets
-
ArSarcasm-v2
-
ArSarcasm
-
FreCDo
-
997 Hours – Wuhan Dialect Speech Data by Mobile Phone
-
800 Hours - Sichuan Dialect Conversational Speech Data by Mobile Phone
-
249 Hours - Hangzhou Dialect Speech Data by Mobile Phone
-
500 Hours - Kazakh Colloquial Video Speech Data
-
505 Hours - Uyghur Colloquial Video Speech Data
-
67 Hours - Northeast Dialect Speech Data by Mobile Phone
-
794 Hours - Sichuan Dialect Speech Data by Mobile Phone
Most implemented papers
AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech Detection Dataset
This paper releases "AraCOVID19-MFH" a manually annotated multi-label Arabic COVID-19 fake news and hate speech detection dataset.
Automatic Dialect Detection in Arabic Broadcast Speech
We used these features in a binary classifier to discriminate between Modern Standard Arabic (MSA) and Dialectal Arabic, with an accuracy of 100%.
Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource
With this research, we provide the embeddings themselves, the relation evaluation task benchmark for use in further research, and demonstrate how the benchmarked embeddings prove a useful unsupervised linguistic resource, effectively used in a downstream task.
A Character-level Convolutional Neural Network for Distinguishing Similar Languages and Dialects
Discriminating between closely-related language varieties is considered a challenging and important task.
Speech Recognition Challenge in the Wild: Arabic MGB-3
Two hours of audio per dialect were released for development and a further two hours were used for evaluation.
CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing
We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing in Python.
Multi-Dialect Arabic BERT for Country-Level Dialect Identification
Our winning solution itself came in the form of an ensemble of different training iterations of our pre-trained BERT model, which achieved a micro-averaged F1-score of 26. 78% on the subtask at hand.
The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification
We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared to machine learning (ML) models.
Toward Micro-Dialect Identification in Diaglossic and Code-Switched Environments
Although the prediction of dialects is an important language processing task, with a wide range of applications, existing work is largely limited to coarse-grained varieties.
Adapting MARBERT for Improved Arabic Dialect Identification: Submission to the NADI 2021 Shared Task
Tasks are to identify the geographic origin of short Dialectal (DA) and Modern Standard Arabic (MSA) utterances at the levels of both country and province.