BiodivTab: Semantic Table Annotation Benchmark Construction, Analysis, and New Additions

Ontology Matching@ISWC 2022 2023 · Nora Abdelmageed, Sirko Schindler, Birgitta König-Ries ·

Systems that annotate tabular data semantically have witnessed increasing attention from the community in recent years; this process is commonly known as Semantic Table Annotation (STA). Its objective is to map individual table elements to their counterparts from a Knowledge Graph (KG). Individual cells and columns are assigned to KG entities and classes to disambiguate their meaning. STA-systems achieve high scores on the existing, synthetic benchmarks but often struggle on real-world datasets. Thus, realistic evaluation benchmarks are needed to enable the advancement of the field. In this paper, we detail the construction pipeline of BiodivTab, the first benchmark based on real-world data from the biodiversity domain. In addition, we compare it with the existing benchmarks. Moreover, we highlight common data characteristics and challenges in the field. BiodivTab is publicly available and has 50 tables as a mixture of real and augmented samples from biodiversity datasets. It has been applied during the SemTab 2021 challenge, and participants achieved F1-scores of at most ∼ 60% across individual annotation tasks. Such results show that domain-specific benchmarks are more challenging for state-of-the-art systems than synthetic datasets.

PDF