🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task (clear)

Filter by Language

17 dataset results for Domain Adaptation AND Texts

ASPEC (Asian Scientific Paper Excerpt Corpus)

ASPEC, Asian Scientific Paper Excerpt Corpus, is constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). It consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010.

84 PAPERS • NO BENCHMARKS YET

MTNT

The Machine Translation of Noisy Text (MTNT) dataset is a Machine Translation dataset that consists of noisy comments on Reddit and professionally sourced translation. The translation are between French, Japanese and French, with between 7k and 37k sentence per language pair.

51 PAPERS • NO BENCHMARKS YET

Amazon Product Data

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

33 PAPERS • 6 BENCHMARKS

KdConv (Knowledge-driven Conversation)

KdConv is a Chinese multi-domain Knowledge-driven Conversation dataset, grounding the topics in multi-turn conversations to knowledge graphs. KdConv contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0. These conversations contain in-depth discussions on related topics and natural transition between multiple topics, while the corpus can also used for exploration of transfer learning and domain adaptation.

21 PAPERS • NO BENCHMARKS YET

TechQA (The TechQA Dataset)

TECHQA is a domain-adaptation question answering dataset for the technical support domain. The TECHQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size – 600 training, 310 dev, and 490 evaluation question/answer pairs – thus reflecting the cost of creating large labeled datasets with actual data. Consequently, TECHQA is meant to stimulate research in domain adaptation rather than being a resource to build QA systems from scratch. The dataset was obtained by crawling the IBM Developer and IBM DeveloperWorks forums for questions with accepted answers that appear in a published IBM Technote—a technical document that addresses a specific technical issue.

17 PAPERS • NO BENCHMARKS YET

JESC (Japanese-English Subtitle Corpus)

Japanese-English Subtitle Corpus is a large Japanese-English parallel corpus covering the underrepresented domain of conversational dialogue. It consists of more than 3.2 million examples, making it the largest freely available dataset of its kind. The corpus was assembled by crawling and aligning subtitles found on the web.

16 PAPERS • NO BENCHMARKS YET

CMU DoG (CMU Document Grounded Conversations Dataset)

This is a document grounded dataset for text conversations. "Document Grounded Conversations" are conversations that are about the contents of a specified document. In this dataset the specified documents are Wikipedia articles about popular movies. The dataset contains 4112 conversations with an average of 21.43 turns per conversation.

14 PAPERS • NO BENCHMARKS YET

Multilingual Reuters (Multilingual Reuters Collection)

The Multilingual Reuters Collection dataset comprises over 11,000 articles from six classes in five languages, i.e., English (E), French (F), German (G), Italian (I), and Spanish (S).

13 PAPERS • 1 BENCHMARK

CrossNER

CrossNER is a cross-domain NER (Named Entity Recognition) dataset, a fully-labeled collection of NER data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains.

11 PAPERS • 1 BENCHMARK

NLPeer

NLPeer is a multidomain corpus of more than 5k papers and 11k review reports from five different venues. In addition to the new datasets of paper drafts, camera-ready versions and peer reviews from the NLP community, this dataset has a unified data representation, and augment previous peer review datasets to include parsed, structured paper representations, rich metadata and versioning information.

4 PAPERS • NO BENCHMARKS YET

PLABA (Plain Language Adaptation of Biomedical Abstracts)

Plain Language Adaptation of Biomedical Abstracts (PLABA) is a dataset designed for automatic adaptation that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts, totaling 7643 sentence pairs.

4 PAPERS • NO BENCHMARKS YET

Stanceosaurus

Stanceosaurus is a corpus of 28,033 tweets in English, Hindi, and Arabic annotated with stance towards 251 misinformation claims. The claims in Stanceosaurus originate from 15 fact-checking sources that cover diverse geographical regions and cultures. Unlike existing stance datasets, it introduces a more fine-grained 5-class labeling strategy with additional subcategories to distinguish implicit stance.

3 PAPERS • NO BENCHMARKS YET

Youtubean

Youtbean is a dataset created from closed captions of YouTube product review videos. It can be used for aspect extraction and sentiment classification.

3 PAPERS • NO BENCHMARKS YET

Illness-dataset

Illness-dataset (Illness multi-domain textual dataset)

A dataset for evaluating text classification, domain adaptation, and active learning models. The dataset consists of 22,660 documents (tweets) collected in 2018 and 2019. It spans across four domains: Alzheimer's, Parkinson's, Cancer, and Diabetes.

2 PAPERS • NO BENCHMARKS YET

MSDA (Multi-source domain adaptation dataset for text recognition)

5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images

2 PAPERS • 2 BENCHMARKS

XL-R2R

XL-R2R (Cross-lingual Room-to-Room)

The XL-R2R dataset is built upon the R2R dataset and extends it with Chinese instructions. XL-R2R preserves the same splits as in R2R and thus consists of train, val-seen, and val-unseen splits with both English and Chinese instructions, and test split with English instructions only.

2 PAPERS • NO BENCHMARKS YET

WMT 2014 Medical

WMT 2014 Medical (WMT 2014 Medical Translation Task)

The Medical Translation Task of WMT 2014 addresses the problem of domain-specific and genre-specific machine translation. The task is split into two subtasks: summary translation, focused on translation of sentences from summaries of medical articles, and query translation, focused on translation of queries entered by users into medical information search engines. Both subtasks included translation between English and Czech, German, and French, in both directions.

1 PAPER • NO BENCHMARKS YET

Datasets

17 dataset results for Domain Adaptation AND Texts