🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task

Filter by Language (clear)

84 dataset results for Arabic

Arabic Dataset for Commonsense Validation¬†

A benchmark Arabic dataset for commonsense understanding and validation as well as a baseline research and models trained using the same dataset.

2 PAPERS • NO BENCHMARKS YET

QDAT Quran Recitation

QDAT data set contains 1500 WAV files along with sound files stored on Excel CSV file format. The sound file contains links to the WAV files attached with other features: Age, Gender, and the correctness of the recitation of the three recitation rules and the final goal shows the correctness of the whole reading.

2 PAPERS • NO BENCHMARKS YET

Semantic Question Similarity in Arabic

Semantic Question Similarity in Arabic (NSURL-2019 Shared Task 8: Semantic Question Similarity in Arabic)

NSURL-2019 Shared Task 8: Semantic Question Similarity in Arabic

2 PAPERS • NO BENCHMARKS YET

TE141K

A new text effects dataset with 141,081 text effect/glyph pairs in total. The dataset consists of 152 professionally designed text effects rendered on glyphs, including English letters, Chinese characters, and Arabic numerals.

2 PAPERS • NO BENCHMARKS YET

AMFDS (Arabic Multi-Fonts Dataset)

Arabic Multi Fonts Dataset A multi-word multi-font Arabic word-image dataset.

1 PAPER • NO BENCHMARKS YET

AQL-22

AQL-22 (Archive Query Log)

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. The AQL is the first publicly available query log that combines size, scope, and diversity, enabling research on new retrieval models and search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.

1 PAPER • NO BENCHMARKS YET

AjwaOrMedjool

AjwaOrMedjool (AjwaOrMedjool: a binary balanced dataset to teach machine learning‏)

The dataset contains three subsets:

1 PAPER • NO BENCHMARKS YET

Analysing state-backed propaganda websites: a new dataset and linguistic study

This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish.

1 PAPER • NO BENCHMARKS YET

ArSen-20

Sentiment detection remains a pivotal task in natural language processing, yet its development in Arabic lags due to a scarcity of training materials compared to English. Addressing this gap, we present ArSen-20, a benchmark dataset tailored to propel Arabic sentiment detection forward. ArSen-20 comprises 20,000 professionally labeled tweets sourced from Twitter, focusing on the theme of COVID-19 and spanning the period from 2020 to 2023. Beyond tweet content, the dataset incorporates metadata associated with the user, enriching the contextual understanding. ArSen-20 offers a comprehensive resource to foster advancements in Arabic sentiment analysis and facilitate research in this critical domain.

1 PAPER • NO BENCHMARKS YET

AraCovid19-SSD

AraCovid19-SSD is a manually annotated Arabic COVID-19 sarcasm and sentiment detection dataset containing 5,162 tweets.

1 PAPER • NO BENCHMARKS YET

BiMed1.3M

The dataset covers three types of medical interactions in both English and Arabic: - Multiple-choice question answering (MCQA), focusing on specialized medical knowledge. - Open question answering (QA), including real-world consumer questions. - MCQA-Grounded multi-turn chat conversations for dynamic exchanges.

1 PAPER • NO BENCHMARKS YET

CIDAR

CIDAR contains 10,000 instructions and their output. The dataset was created by selecting around 9,109 samples from Alpagasus dataset then translating it to Arabic using ChatGPT. In addition, we append that with around 891 Arabic grammar instructions from the webiste Ask the teacher. All the 10,000 samples were reviewed by around 12 reviewers.

1 PAPER • NO BENCHMARKS YET

Calliar

Calliar is a dataset for Arabic calligraphy. The dataset consists of 2500 json files that contain strokes manually annotated for Arabic calligraphy.

1 PAPER • NO BENCHMARKS YET

DIGITal (Digitally Generated Numerals)

Digitally Generated Numerals (DIGITal) Description The Digitally Generated Numerals (DIGITal) dataset consists of 100,000 image pairs representing digits from 0 to 9. These image pairs include both low and high-quality versions, with a resolution of 128x128 pixels.

1 PAPER • NO BENCHMARKS YET

DivEMT (Post-Editing Effort Across Typologically-diverse Languages)

DivEMT, the first publicly available post-editing study of Neural Machine Translation (NMT) over a typologically diverse set of target languages. Using a strictly controlled setup, 18 professional translators were instructed to translate or post-edit the same set of English documents into Arabic, Dutch, Italian, Turkish, Ukrainian, and Vietnamese. During the process, their edits, keystrokes, editing times and pauses were recorded, enabling an in-depth, cross-lingual evaluation of NMT quality and post-editing effectiveness. Using this new dataset, we assess the impact of two state-of-the-art NMT systems, Google Translate and the multilingual mBART-50 model, on translation productivity.

1 PAPER • NO BENCHMARKS YET

Egyptian Arabic Segmentation Dataset

Contains 350 tweets with more than 8,000 words including 3,000 unique words written in Egyptian dialect. The tweets have much dialectal content covering most of dialectal Egyptian phonological, morphological, and syntactic phenomena. It also includes Twitter-specific aspects of the text, such as #hashtags, @mentions, emoticons and URLs.

1 PAPER • NO BENCHMARKS YET

ExaASC

The ExaASC dataset is a dataset for Target-based Stance Detection in the Arabic Language that contains different types of targets like persons, entities and events. This corpus contains about 9500 tweets with replies and target specified in the source tweet. Each sample has at least two stance annotations provided by Exa Corporation annotators. The stance of each reply is annotated toward the target in the corresponding source tweet. Format of data is as follows: id, main (source tweet), reply, target, label of each annotator id and majority_label.

1 PAPER • NO BENCHMARKS YET

HARD

HARD (Hotel Arabic-Reviews Dataset)

The Hotel Arabic-Reviews Dataset (HARD) contains 93700 hotel reviews in Arabic language. The hotel reviews were collected from Booking.com website during June/July 2016. The reviews are expressed in Modern Standard Arabic as well as dialectal Arabic.

1 PAPER • 1 BENCHMARK

HumanEval-XL

We introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.

1 PAPER • NO BENCHMARKS YET

LeT-Mi (Levantine Twitter dataset for Misogynistic language)

Levantine Twitter dataset for Misogynistic language (LeT-Mi) is an Arabic Levantine Twitter dataset for misogynistic language to be the first benchmark dataset for Arabic misogyny.

1 PAPER • NO BENCHMARKS YET

Mega-COV

Mega-COV is a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 234 countries), longitudinal (goes as back as 2007), multilingual (comes in 65 languages), and has a significant number of location-tagged tweets (~32M tweets).

1 PAPER • NO BENCHMARKS YET

Mint

Mint (Multilingual Intimacy analysis)

Mint is a new Multilingual intimacy analysis dataset covering 13,384 tweets in 10 languages including English, French, Spanish, Italian, Portuguese, Korean, Dutch, Chinese, Hindi, and Arabic. The dataset is released along with the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis.

1 PAPER • NO BENCHMARKS YET

MultiTACRED

MultiTACRED is a multilingual version of the large-scale TAC Relation Extraction Dataset. It covers 12 typologically diverse languages from 9 language families, and was created by the Speech & Language Technology group of DFKI by machine-translating the instances of the original TACRED dataset and automatically projecting their entity annotations. For details of the original TACRED's data collection and annotation process, see the Stanford paper. Translations are syntactically validated by checking the correctness of the XML tag markup. Any translations with an invalid tag structure, e.g. missing or invalid head or tail tag pairs, are discarded (on average, 2.3% of the instances).

1 PAPER • NO BENCHMARKS YET

No Background RGB Arabic Alphabets Sign Language Dataset

The AASL-Clear dataset is a collection of RGB images featuring Arabic alphabet sign Language gestures with backgrounds removed. Each image in this dataset showcases clear, isolated hand gestures, allowing for precise recognition and analysis of Arabic sign language alphabets. With transparent backgrounds, this dataset provides a clean and focused resource for training deep learning models in the domain of Arabic sign language recognition and classification.

1 PAPER • 1 BENCHMARK

OTEANNv3

This dataset contains orthographic samples of words in 19 languages (ar, br, de, en, eno, ent, eo, es, fi, fr, fro, it, ko, nl, pt, ru, sh, tr, zh). Each sample contains two text features: a Word (the textual representation of the word according to its orthography) and a Pronunciation (the highest-surface IPA pronunciation of the word as pronunced in its language).

1 PAPER • NO BENCHMARKS YET

RGB Arabic Alphabet Sign Language (AASL) dataset

1 PAPER • 1 BENCHMARK

RGB Arabic Alphabets Sign Language Dataset

This paper introduces the RGB Arabic Alphabet Sign Language (AASL) dataset. AASL comprises 7,856 raw and fully labeled RGB images of the Arabic sign language alphabets, which to our best knowledge is the first publicly available RGB dataset. The dataset is aimed to help those interested in developing real-life Arabic sign language classification models. AASL was collected from more than 200 participants and with different settings such as lighting, background, image orientation, image size, and image resolution. Experts in the field supervised, validated and filtered the collected images to ensure a high-quality dataset. AASL is made available to the public on Kaggle.

1 PAPER • NO BENCHMARKS YET

UTRSet-Synth

The UTRSet-Synth dataset is introduced as a complementary training resource to the UTRSet-Real Dataset, specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.

1 PAPER • NO BENCHMARKS YET

AraMeter

A dataset to identify the meters of Arabic poems.

0 PAPER • NO BENCHMARKS YET

Arabic Handwritten Digits Dataset

Contain Arabic handwritten digits images (60000 training and 10000 testing images).

0 PAPER • NO BENCHMARKS YET

Arabic Speech Corpus

The Arabic Speech Corpus (1.5 GB) is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The annotations include word stress marks on the individual phonemes The Speech corpus has been developed as part of PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.

0 PAPER • NO BENCHMARKS YET

COVID-19 Disinfo (COVID-19 Disinformation Twitter Dataset)

With the emergence of the COVID-19 pandemic, the political and the medical aspects of disinformation merged as the problem got elevated to a whole new level to become the first global infodemic. Fighting this infodemic has been declared one of the most important focus areas of the World Health Organization, with dangers ranging from promoting fake cures, rumors, and conspiracy theories to spreading xenophobia and panic. Addressing the issue requires solving a number of challenging problems such as identifying messages containing claims, determining their check-worthiness and factuality, and their potential to do harm as well as the nature of that harm, to mention just a few. To address this gap, we release a large dataset of 16K manually annotated tweets for fine-grained disinformation analysis that focuses on COVID-19, combines the perspectives and the interests of journalists, fact-checkers, social media platforms, policy makers, and society, and covers Arabic, Bulgarian, Dutch, and

0 PAPER • NO BENCHMARKS YET

ISI-PPT

This is a Dataset for Arabic/English text detection and optical character recognition. All image data are text-slides extracted from PowerPoint files downloaded from Internet through the Google API. All annotations are automatically generated mainly through the WinCom32 Python API. Postprocess is also applied to place a more accurate text bounding box or to suppress false-alarms, e.g. a text box only containing spaces. Finally, all annotation results are briefly reviewed by human to reject extreme bad samples, e.g. a slide with a large portion of copied table as image. In summary, this dataset contains 10,692 images, and roughly 100K line samples.

0 PAPER • NO BENCHMARKS YET

MNAD (Moroccan News Articles Dataset)

About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.

0 PAPER • NO BENCHMARKS YET

RuFa

RuFa (Ruqaa-Farsi) dataset contains images of text written in one of two Arabic fonts: Ruqaa and Nastaliq (Farsi). The dataset contains 40,000 synthesized image and 516 real one, 40,516 in total. Images are in RGB JPG format at 100×100px. Text in the images has varying number of words, position, size, and opacity.

0 PAPER • NO BENCHMARKS YET

Toloka WaterMeters

This datase, contains 1244 images of hot and cold water meters as well as their readings and coordinates of the displays showing those readings. Each image contains exactly one water meter. The archive also includes the pictures of the results of segmentation with the masks and collages. Toloka was used for photo capturing, segmentation, and recognizing the readings.

0 PAPER • NO BENCHMARKS YET

Datasets

84 dataset results for Arabic