The Amazon-Google dataset for entity resolution derives from the online retailers Amazon.com and the product search service of Google accessible through the Google Base Data API. The dataset contains 1363 entities from amazon.com and 3226 google products as well as a gold standard (perfect mapping) with 1300 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description, manufacturer and price.
20 PAPERS • 2 BENCHMARKS
The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1081 entities from abt.com and 1092 entities from buy.com as well as a gold standard (perfect mapping) with 1097 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price.
19 PAPERS • 2 BENCHMARKS
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.
8 PAPERS • 4 BENCHMARKS
DBLP Temporal is a dataset for temporal entity resolution, based on author profiles extracted from the Digital Bibliography and Library Project (DBLP).
7 PAPERS • 1 BENCHMARK
WDC Products is an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are
5 PAPERS • 4 BENCHMARKS
The MusicBrainz20K dataset for entity resolution and entity clustering is based on real records about songs from the MusicBrainz database. Each record is described with the following attributes: artist, title, album, year and length. The records have been modified with the DAPO [1] data generator. The generated dataset consists of five sources and approximately 20K records describing 10K unique song entities. It contains duplicates for 50% of the original records in two to five sources which are generated with a high degree of corruption to stress-test the entity resolution and clustering approaches.
2 PAPERS • 1 BENCHMARK
Hand-disambiguation of a sample of U.S. patents inventor mentions from PatentsView.org.
1 PAPER • NO BENCHMARKS YET
CEREC is a large scale corpus for entity resolution in email conversations. The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 60,383 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort.
The dataset contains entities from IMDB, TheMovieDB and TheTVDB with goldstandard matches between the sources. Due to the licensing of IMDB we provide a script to build the IMDB part of the dataset yourself.
PIZZA is a dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.
This dataset is used for user identity linkage across two online social networks in Chinese. It contains two popular Chinese social platforms: Sina Weibo\footnote{https://weibo.com} and Douban\footnote{https://www.douban.com}.