CANNOT Dataset | Papers With Code

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

## Dataset Summary

**CANNOT** is a dataset that focuses on negated textual pairs. It currently
contains **77,376 samples**, of which roughly of them are negated pairs of
sentences, and the other half are not (they are paraphrased versions of each
other).

The most frequent negation that appears in the dataset is verbal negation (e.g.,
will → won't), although it also contains pairs with antonyms (cold → hot).

<br>

## Languages
CANNOT includes exclusively texts in **English**.

<br>

## Dataset Structure

The dataset is given as a
[`.tsv`](https://en.wikipedia.org/wiki/Tab-separated_values) file with the
following structure:

| premise     | hypothesis                                         | label |
|:------------|:---------------------------------------------------|:-----:|
| A sentence. | An equivalent, non-negated sentence (paraphrased). | 0     |
| A sentence. | The sentence negated.                              | 1     |

The dataset can be easily loaded into a Pandas DataFrame by running:

```Python
import pandas as pd

dataset = pd.read_csv('negation_dataset_v1.0.tsv', sep='\t')

```

<br>

## Dataset Creation

The dataset has been created by cleaning up and merging the following datasets:

1. _Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal
    Negation_ (see
[`datasets/nan-nli`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/nan-nli)).

2. _GLUE Diagnostic Dataset_ (see
[`datasets/glue-diagnostic`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/glue-diagnostic)).

3. _Automated Fact-Checking of Claims from Wikipedia_ (see
[`datasets/wikifactcheck-english`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/wikifactcheck-english)).

4. _From Group to Individual Labels Using Deep Features_ (see
[`datasets/sentiment-labelled-sentences`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/sentiment-labelled-sentences)).
In this case, the negated sentences were obtained by using the Python module
[`negate`](https://github.com/dmlls/negate).

5. _It Is Not Easy To Detect Paraphrases: Analysing Semantic Similarity With
Antonyms and Negation Using the New SemAntoNeg Benchmark_ (see
[`datasets/antonym-substitution`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/antonym-substitution)).

Once processed, the number of remaining samples in each of the datasets above are:

| Dataset                                                                   | Samples    |
|:--------------------------------------------------------------------------|-----------:|
| Not another Negation Benchmark                                            |      118   |
| GLUE Diagnostic Dataset                                                   |      154   |
| Automated Fact-Checking of Claims from Wikipedia                          |   14,970   |
| From Group to Individual Labels Using Deep Features                       |    2,110   |
| It Is Not Easy To Detect Paraphrases                                      |    8,597   |
| <div align="right"><b>Total</b></div>                                     | **25,949** |

Additionally, for each of the negated samples, another pair of non-negated
sentences has been added by paraphrasing them with the pre-trained model
[`🤗tuner007/pegasus_paraphrase`](https://huggingface.co/tuner007/pegasus_paraphrase).

Finally, the swapped version of each pair (premise ⇋ hypothesis) has also been
included, and any duplicates have been removed.

With this, the number of premises/hypothesis in the CANNOT dataset that appear
in the original datasets are:

| <div align="left"><b>Dataset</b></div>                                                                   | <div align="center"><b>Sentences</b></div>             |
|:--------------------------------------------------------------------------|----------------------:|
| Not another Negation Benchmark                                            |         552 &nbsp;&nbsp;&nbsp; (0.36 %) |
| GLUE Diagnostic Dataset                                                   |         586 &nbsp;&nbsp;&nbsp; (0.38 %) |
| Automated Fact-Checking of Claims from Wikipedia                          |      89,728 &nbsp; (59.98 %) |
| From Group to Individual Labels Using Deep Features                       |      12,626 &nbsp;&nbsp;&nbsp; (8.16 %) |
| It Is Not Easy To Detect Paraphrases                                      |      17,198 &nbsp; (11.11 %) |
| <div align="right"><b>Total</b></div>                                     | **120,690** &nbsp; (77.99 %) |

The percentages above are in relation to the total number of premises and
hypothesis in the CANNOT dataset. The remaining 22.01 % (34,062 sentences) are
the novel premises/hypothesis added through paraphrase and rule-based negation.

<br>

## Additional Information

<br>

### Licensing Information

The CANNOT dataset is released under [CC BY-SA
4.0](https://creativecommons.org/licenses/by-sa/4.0/).

<br>

### Citation 
Please cite our [INLG 2023 paper](https://arxiv.org/abs/2307.13989), if you use our dataset. 
**BibTeX:**
```bibtex
@misc{anschütz2023correct,
      title={This is not correct! Negation-aware Evaluation of Language Generation Systems}, 
      author={Miriam Anschütz and Diego Miguel Lozano and Georg Groh},
      year={2023},
      eprint={2307.13989},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

<br>

### Contributions

Contributions to the dataset can be submitted through the [project
repository](https://github.com/dmlls/cannot-dataset).

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

Currently

datasets/30677095-0f41-48fd-b8e3-684acad6138e.jpg Clear

Change

---

Dataset	Sentences
Not another Negation Benchmark	552 (0.36 %)
GLUE Diagnostic Dataset	586 (0.38 %)
Automated Fact-Checking of Claims from Wikipedia	89,728 (59.98 %)
From Group to Individual Labels Using Deep Features	12,626 (8.16 %)
It Is Not Easy To Detect Paraphrases	17,198 (11.11 %)
Total	120,690 (77.99 %)

CANNOT (Compilation of ANnotated, Negation-Oriented Text-pairs)

Dataset Summary

Languages

Dataset Structure

Dataset Creation

Additional Information

Licensing Information

Citation

Contributions

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

Demetr

Usage

License

Modalities

Languages

premise	hypothesis	label
A sentence.	An equivalent, non-negated sentence (paraphrased).	0
A sentence.	The sentence negated.	1

CANNOT (Compilation of ANnotated, Negation-Oriented Text-pairs)

Dataset Summary

Languages

Dataset Structure

Dataset Creation

Additional Information

Licensing Information

Citation

Contributions

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

Demetr

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages