WDC Block (WDC Block: A Blocking Benchmark)

Introduced by Brinkmann et al. in SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines

WDC Block is a benchmark for comparing the performance of blocking methods that are used as part of entity resolution pipelines.

Entity resolution aims to identify records in two datasets (A and B) that describe the same real-world entity. Since comparing all record pairs between two datasets can be computationally expensive, entity resolution is approached in two steps, blocking and matching. Blocking applies a computationally cheap method to remove non-matching record pairs and produces a smaller set of candidate record pairs reducing the workload of the matcher. During matching a more expensive pair-wise matcher produces a final set of matching record pairs.

Existing benchmark datasets for blocking and matching are rather small with respect to the Cartesian product AxB for comparing all records and the vocabulary size. If blockers are evaluated only on these small datasets, effects resulting from a high number of records or from a large vocabulary size (large number of unique tokens that need to be indexed) may be missed. The Web Data Commons Block (WDC-Block) is a new blocking benchmark that provides much larger datasets and thus requires blockers that address these scalability challenges. WDC Block features a maximal Cartesian product of 200 billion pairs of product offers which were extracted form 3,259 e-shops. Additionally, we provide three development sets with different sizes (~1K pairs, ~5K pairs & ~20K pairs) to experiment with different amounts of training data for the blockers.

Homepage