ptype-cat: Inferring the Type and Values of Categorical Variables

23 Nov 2021  ·  Taha Ceritli, Christopher K. I. Williams ·

Type inference is the task of identifying the type of values in a data column and has been studied extensively in the literature. Most existing type inference methods support data types such as Boolean, date, float, integer and string. However, these methods do not consider non-Boolean categorical variables, where there are more than two possible values encoded by integers or strings. Therefore, such columns are annotated either as integer or string rather than categorical, and need to be transformed into categorical manually by the user. In this paper, we propose a probabilistic type inference method that can identify the general categorical data type (including non-Boolean variables). Additionally, we identify the possible values of each categorical variable by adapting the existing type inference method ptype. Combining these methods, we present ptype-cat which achieves better results than existing applicable solutions.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here