Semantic Structure Extraction for Spreadsheet Tables with a Multi-task Learning Architecture

Semantic structure extraction for spreadsheets includes detecting table regions, recognizing structural components and classifying cell types. Automatic semantic structure extraction is key to automatic data transformation from various table structures into canonical schema so as to enable data analysis and knowledge discovery. However, they are challenged by the diverse table structures and the spatial-correlated semantics on cell grids. To learn spatial correlations and capture semantics on spreadsheets, we have developed a novel learning-based framework for spreadsheet semantic structure extraction. First, we propose a multi-task framework that learns table region, structural components and cell types jointly; second, we leverage the advances of the recent language model to capture semantics in each cell value; third, we build a large human-labeled dataset with broad coverage of table structures. Our evaluation shows that our proposed multi-task framework is highly effective that outperforms the results of training each task separately.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here