This is a dataset of paraphrases created by ChatGPT.
We used this prompt to generate paraphrases:
Generate 5 similar paraphrases for this question, show it like a numbered list without commentaries: {text}
This dataset is based on the Quora paraphrase question, texts from the SQUAD 2.0 and the CNN news dataset.
We generated 5 paraphrases for each sample, totally this dataset has about 350k data rows. You can make 30 rows from a row from each sample. In this way you can make 10.5 millions train pairs (350k rows with 5 paraphrases -> 6x5x350000 = 10.5 millions of bidirected or 6x5x350000/2 = 5.25 millions of unique pairs).
We used:
231927 questions from the Quora dataset
92005 texts from the Squad 2.0 dataset
29110 texts from the CNN news dataset
Structure of the dataset:
text column - an original sentence or question from the datasets
paraphrases - a list of 5 paraphrases
category - question / sentence
source - quora / squad_2 / cnn_news
Paper | Code | Results | Date | Stars |
---|