XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense reasoning capabilities.
3 PAPERS • 1 BENCHMARK
A comprehensive multi-task benchmark for the Polish language understanding, accompanied by an online leaderboard. It consists of a diverse set of tasks, adopted from existing datasets for named entity recognition, question-answering, textual entailment, and others.
2 PAPERS • NO BENCHMARKS YET
A benchmark Arabic dataset for commonsense understanding and validation as well as a baseline research and models trained using the same dataset.
ChatLog is a coarse-to-fine temporal dataset called ChatLog, consisting of two parts that update monthly and daily:
GeoGLUE is a GeoGraphic Language Understanding Evaluation benchmark, which consists of six geographic text-related tasks, including geographic textual similarity on recall, geotagged geographic elements tagging, geographic composition analysis, geographic where what cut, and geographic entity alignment. All tasks' datasets are collected from open-released resources.
NusaCrowd is a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, the authors have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments.
RuMedBench is a benchmark dataset for Russian medical language understanding.
The CANDOR corpus is a large, novel, multimodal corpus of 1,656 recorded conversations in spoken English. This 7+ million word, 850 hour corpus totals over 1TB of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression, along with an extensive survey of speaker post conversation reflections.
1 PAPER • NO BENCHMARKS YET
CLUES (Constrained Language Understanding Evaluation Standard) is a benchmark for evaluating the few-shot learning capabilities of NLU models.
Dialog-based Language Learning dataset is designed to measure how well models can perform at learning as a student given a teacher’s textual responses to the student’s answer (as well as potentially receiving an external real-valued reward signal).
EPIC30M contains a subset of 26.2 millions tweets related to three general diseases, namely Ebola, Cholera and Swine Flu, and another subset of 4.7 millions tweets of six global epidemic outbreaks, including 2009 H1N1 Swine Flu, 2010 Haiti Cholera, 2012 Middle-East Respiratory Syndrome (MERS), 2013 West African Ebola, 2016 Yemen Cholera and 2018 Kivu Ebola.
ExPUNations is a humor dataset with such extensive and fine-grained annotations specifically for puns. This dataset is designed for two new tasks namely, explanation generation to aid with pun classification and keyword-conditioned pun generation
Introduction The FewGLUE_64_labeled dataset is a new version of FewGLUE dataset. It contains a 64-sample training set, a development set (the original SuperGLUE development set), a test set, and an unlabeled set. It is constructed to facilitate the research of few-shot learning for natural language understanding tasks.
IDK-MRC is an Indonesian Machine Reading Comprehension (MRC) dataset consists of more than 10K questions in total with over 5K unanswerable questions with diverse question types.
ImagiFilter focusses on photographic and/or natural images, a very common use-case in computer vision research. Annotations for coarse prediction are provided, i.e. photographic vs. non-photographic, and smaller fine-grained prediction tasks where the non-photographic class is broken down into five classes: maps, drawings, graphs, icons, and sketches.
MultiWOZ-coref, (or MultiWOZ 2.3) is an extension of the MultiWOZ dataset that adds co-reference annotations in addition to corrections of dialogue acts and dialogue states.
This project is a collection of three corpora which can be used for evaluating chatbots or other conversational interfaces. Two of the corpora were extracted from StackExchange, one from a Telegram chatbot.
Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.
We release various types of word embeddings for multiple Indian languages. Please note that for a majority of our work, we had transliterated the corpora to the Devanagiri script and the script is changed. Word Embedding models using FastText, ElMo, and cross-lingual models based on an orthogonal alignment of monolingual models for all pairs of these languages.
This datasets consists of challenging reasoning questions in multiple choice format.