The Amazon-Google dataset for entity resolution derives from the online retailers Amazon.com and the product search service of Google accessible through the Google Base Data API. The dataset contains 1363 entities from amazon.com and 3226 google products as well as a gold standard (perfect mapping) with 1300 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description, manufacturer and price.
8 PAPERS • 1 BENCHMARK
The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1081 entities from abt.com and 1092 entities from buy.com as well as a gold standard (perfect mapping) with 1097 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price.
7 PAPERS • 1 BENCHMARK
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.
7 PAPERS • 4 BENCHMARKS
DBLP Temporal is a dataset for temporal entity resolution, based on author profiles extracted from the Digital Bibliography and Library Project (DBLP).
5 PAPERS • 1 BENCHMARK
The MusicBrainz20K dataset for entity resolution and entity clustering is based on real records about songs from the MusicBrainz database. Each record is described with the following attributes: artist, title, album, year and length. The records have been modified with the DAPO  data generator. The generated dataset consists of five sources and approximately 20K records describing 10K unique song entities. It contains duplicates for 50% of the original records in two to five sources which are generated with a high degree of corruption to stress-test the entity resolution and clustering approaches.
2 PAPERS • 1 BENCHMARK
CEREC is a large scale corpus for entity resolution in email conversations. The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 60,383 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort.
1 PAPER • NO BENCHMARKS YET