Identifying the Units of Measurement in Tabular Data

23 Nov 2021  ·  Taha Ceritli, Christopher K. I. Williams ·

We consider the problem of identifying the units of measurement in a data column that contains both numeric values and unit symbols in each row, e.g., "5.2 l", "7 pints". In this case we seek to identify the dimension of the column (e.g. volume) and relate the unit symbols to valid units (e.g. litre, pint) obtained from a knowledge graph. Below we present PUC, a Probabilistic Unit Canonicalizer that can accurately identify the units of measurement, extract semantic descriptions of quantitative data columns and canonicalize their entries. We present the first messy real-world tabular datasets annotated for units of measurement, which can enable and accelerate the research in this area. Our experiments on these datasets show that PUC achieves better results than existing solutions.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here