12 dataset results for Domain Adaptation AND Texts AND English

ASPEC (Asian Scientific Paper Excerpt Corpus)

ASPEC, Asian Scientific Paper Excerpt Corpus, is constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010.

84 PAPERS • NO BENCHMARKS YET

MTNT

The Machine Translation of Noisy Text (MTNT) dataset is a Machine Translation dataset that consists of noisy comments on Reddit and professionally sourced translation. The translation are between French, Japanese and French, with between 7k and 37k sentence per language pair.

51 PAPERS • NO BENCHMARKS YET

Amazon Product Data

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

33 PAPERS • 6 BENCHMARKS

Multilingual Reuters (Multilingual Reuters Collection)

The Multilingual Reuters Collection dataset comprises over 11,000 articles from six classes in five languages, i.e., English (E), French (F), German (G), Italian (I), and Spanish (S).

13 PAPERS • 1 BENCHMARK

CrossNER

CrossNER is a cross-domain NER (Named Entity Recognition) dataset, a fully-labeled collection of NER data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains.

11 PAPERS • 1 BENCHMARK

PLABA (Plain Language Adaptation of Biomedical Abstracts)

Plain Language Adaptation of Biomedical Abstracts (PLABA) is a dataset designed for automatic adaptation that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts, totaling 7643 sentence pairs.

4 PAPERS • NO BENCHMARKS YET

Stanceosaurus

Stanceosaurus is a corpus of 28,033 tweets in English, Hindi, and Arabic annotated with stance towards 251 misinformation claims. The claims in Stanceosaurus originate from 15 fact-checking sources that cover diverse geographical regions and cultures. Unlike existing stance datasets, it introduces a more fine-grained 5-class labeling strategy with additional subcategories to distinguish implicit stance.

3 PAPERS • NO BENCHMARKS YET

Youtubean

Youtbean is a dataset created from closed captions of YouTube product review videos. It can be used for aspect extraction and sentiment classification.

3 PAPERS • NO BENCHMARKS YET

Illness-dataset

Illness-dataset (Illness multi-domain textual dataset)

A dataset for evaluating text classification, domain adaptation, and active learning models. The dataset consists of 22,660 documents (tweets) collected in 2018 and 2019. It spans across four domains: Alzheimer's, Parkinson's, Cancer, and Diabetes.

2 PAPERS • NO BENCHMARKS YET

MSDA (Multi-source domain adaptation dataset for text recognition)

5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images

2 PAPERS • 2 BENCHMARKS

XL-R2R

XL-R2R (Cross-lingual Room-to-Room)

The XL-R2R dataset is built upon the R2R dataset and extends it with Chinese instructions. XL-R2R preserves the same splits as in R2R and thus consists of train, val-seen, and val-unseen splits with both English and Chinese instructions, and test split with English instructions only.

2 PAPERS • NO BENCHMARKS YET

WMT 2014 Medical

WMT 2014 Medical (WMT 2014 Medical Translation Task)

The Medical Translation Task of WMT 2014 addresses the problem of domain-specific and genre-specific machine translation. The task is split into two subtasks: summary translation, focused on translation of sentences from summaries of medical articles, and query translation, focused on translation of queries entered by users into medical information search engines. Both subtasks included translation between English and Czech, German, and French, in both directions.

1 PAPER • NO BENCHMARKS YET

Datasets

12 dataset results for Domain Adaptation AND Texts AND English