Fact-checking (FC) articles which contains pairs (multimodal tweet and a FC-article) from snopes.com.
20 PAPERS • 1 BENCHMARK
MedICaT is a dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references. Figures and captions are extracted from open access articles in PubMed Central and corresponding reference text is derived from S2ORC. The dataset consists of: 217,060 figures from 131,410 open access papers 7507 subcaption and subfigure annotations for 2069 compound figures Inline references for ~25K figures in the ROCO dataset
16 PAPERS • NO BENCHMARKS YET
Fact-checking (FC) articles which contains pairs (multimodal tweet and a FC-article) from politifact.com.
16 PAPERS • 1 BENCHMARK
DuLeMon is a large-scale Chinese Long-term Memory Conversation dataset, which simulates long-term memory conversations and focuses on the ability to actively construct and utilize the user's and the bot's persona in a long-term interaction. DuLeMon contains about 27.5k human-human conversations, 449k utterances, and 12k persona grounding sentences. This corpus can be used to explore Long-term Memory Conversation, Personalized Dialogue, and Persona Extraction / Matching / Retrieval.
11 PAPERS • NO BENCHMARKS YET
A large-scale curated dataset of over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st to April 4th at the time of writing.
10 PAPERS • 6 BENCHMARKS
The Composed Quora dataset consists of questions extracted from Quora that are grouped together if they are asking the same thing. The dataset contains 60,400 groups of questions, each group with at least 3 questions that are asking the same.
1 PAPER • NO BENCHMARKS YET
PSM is a financial-domain dataset of the pairwise search matching task. It aims to identify the semantic similarity of a sentence pair in the search scenario.