We used the following procedure. First, we automatically identified the set of verbs and nouns to build our items from. To do so, we started with bert-base-uncased vocabulary. We ran all non-subword lexical tokens through a SpaCy POS. Further, we lemmatized the result using https://pypi.org/project/Pattern/ and dropped duplicates. Then, we filtered out modal verbs, singularia tantum nouns and some visible lemmatization mistakes. Finally, we filtered out non-transitive verbs to give the dataset a bit of a higher baseline of grammaticality.
We kept top 100 nouns and top 100 verbs from the resulting lists -- these are the lexical entries we will deal with. Then, we generated sentences with these words. For this, we iterate over the 100 nouns in the subject and the object positions (excluding cases where the same noun appears in both positions) and over the 100 verbs. The procedure gave us 990k sentences like these:
Some are more natural, make more sense and adhere to the verb's selectional restrictions better than the others. To control for this, we ran the sentences through GPT-2 and assigned perplexity to all candidates. Then we took the bottom 20k of the sentences (the most 'natural' ones) as the core of our synthetic dataset.
We tried to approximate the 'naturalness' of examples by a combination of measures. We rely on insights from different models (GPT-2, BERT, corpus-based statistical insights into verb transitivity) on different stages of the dataset creation. Still, some sentences sound intuitively 'weird'. We don't see this as a problem though -- we will not rely directly on the naturalness of individual the examples, rather we will measure the effect of the NPI across the dataset. The amount of the examples will allow us to generalize across varying parts of the sentences to make sure that the results can be attributed to the parts we are interested in: items responsible for the monotonicity of the sentence. The quantity of test items is crucial for reproducing psycholinguistic experiments on LRMs -- while in the former one sentence gives rise to a number of observations when different human subjects make a judgment, in the latter one test sentence gives you one observation only.% Here the procedures of psycholinguistic studies and LRM studies necessarily diverge.
With this in mind, we use the 20k sentences produced by the previous steps to build the parts of our synthetic dataset. Each of the sentences has a pluralized (not anymore singular!) object in combination with any: any roads. The subject type varies in different datasets comprising our synthetic data.
Overall, sentences in all parts of our dataset vary in the type of context it instantiates (simple affirmative, negation, quantifiers of different monotonicity) -- but all sentences contain any in the object position in combination with a plural noun. We will manipulate the presence or absence of any to measure how any plays out with different types of environments.