BiodivTab: Semantic Table Annotation Benchmark Construction, Analysis, and New Additions

Systems that annotate tabular data semantically have witnessed increasing attention from the community in recent years; this process is commonly known as Semantic Table Annotation (STA). Its objective is to map individual table elements to their counterparts from a Knowledge Graph (KG). Individual cells and columns are assigned to KG entities and classes to disambiguate their meaning. STA-systems achieve high scores on the existing, synthetic benchmarks but often struggle on real-world datasets. Thus, realistic evaluation benchmarks are needed to enable the advancement of the field. In this paper, we detail the construction pipeline of BiodivTab, the first benchmark based on real-world data from the biodiversity domain. In addition, we compare it with the existing benchmarks. Moreover, we highlight common data characteristics and challenges in the field. BiodivTab is publicly available and has 50 tables as a mixture of real and augmented samples from biodiversity datasets. It has been applied during the SemTab 2021 challenge, and participants achieved F1-scores of at most ∼ 60% across individual annotation tasks. Such results show that domain-specific benchmarks are more challenging for state-of-the-art systems than synthetic datasets.

PDF

Datasets


Introduced in the Paper:

BiodivTab

Used in the Paper:

DBpedia GitTables

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here