Autoencoder Paraphrase Dataset (AEPD)

Introduced by Wahle et al. in Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection

This is a benchmark for neural paraphrase detection, to differentiate between original and machine-generated content.

Training:

1,474,230 aligned paragraphs (98,282 original, 1,375,948 paraphrased with 3 models and 5 hyperparameter configurations each 98,282) extracted from 4,012 (English) Wikipedia articles.

Testing:

BERT-large (cased): 
    arXiv             - Original - 20,966;     Paraphrased - 20,966; 
    Theses          - Original - 5,226;      Paraphrased - 5,226;
    Wikipedia      - Original - 39,241;     Paraphrased - 39,241;

RoBERTa-large (cased): 
    arXiv             - Original - 20,966;     Paraphrased - 20,966; 
    Theses          - Original - 5,226;      Paraphrased - 5,226;
    Wikipedia      - Original - 39,241;     Paraphrased - 39,241;

Longformer-large (uncased): 
    arXiv             - Original - 20,966;     Paraphrased - 20,966; 
    Theses          - Original - 5,226;      Paraphrased - 5,226;
    Wikipedia      - Original - 39,241;     Paraphrased - 39,241;

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


Modalities


Languages