RFUND is a relabeled version of FUNSD and XFUND datasets, tackling the following issues in their original annotations:

  1. Entity (block) level OCR results. Real-world OCR engines usually produce line-level results, while the annotations in FUNSD and XFUND are at the entity (block) level. Text lines within the same entity are aggregated and serialized in human reading order, simplifying the task scope and failing to reflect real-world challenges.
  2. Inconsistent labelling granularity. In FUNSD, while most contents are annotated at the entity level, multi-line entities with first-line indentation are annotated separately, in which the first line is split out and the rest are aggregated. XFUND exhibits variable granularity in annotations, with some contents labelled at the entity level and others at the line level. Such inconsistent labelling standards can hinder model training.
  3. Erroneous category annotations. Entities in FUNSD/XFUND are categorized as "header", "question", "answer" and "other". We observed that certain entities in both FUNSD and XFUND have category labels that differ from human understanding. In the following example, the answer entity “Client confirmed agreement ...” was labelled as other, while the other entity “CONFIDENTIAL” was labelled as the question.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets