ChatGPT Paraphrases Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

This is a dataset of paraphrases created by ChatGPT.

**We used this prompt to generate paraphrases:**                         
Generate 5 similar paraphrases for this question, show it like a numbered list without commentaries: *{text}*

This dataset is based on the [Quora paraphrase question](https://www.kaggle.com/competitions/quora-question-pairs), texts from the [SQUAD 2.0](https://huggingface.co/datasets/squad_v2) and the [CNN news dataset](https://huggingface.co/datasets/cnn_dailymail).

We generated 5 paraphrases for each sample, totally this dataset has about 350k data rows. You can make 30 rows from a row 
from each sample. In this way you can make 10.5 millions train pairs (350k rows with 5 paraphrases -> 6x5x350000 = 10.5 millions of bidirected or 6x5x350000/2 = 5.25 millions of unique pairs).

**We used:**

- 231927 questions from the Quora dataset

- 92005 texts from the Squad 2.0 dataset

- 29110 texts from the CNN news dataset

**Structure of the dataset:**

- text column - an original sentence or question from the datasets

- paraphrases - a list of 5 paraphrases

- category - question / sentence

- source - quora / squad_2 / cnn_news

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

ChatGPT Paraphrases

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages

ChatGPT Paraphrases

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

License Edit

Modalities Edit

Languages Edit