Language Models are Realistic Tabular Data Generators

12 Oct 2022  ยท  Vadim Borisov, Kathrin SeรŸler, Tobias Leemann, Martin Pawelczyk, Gjergji Kasneci ยท

Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as variational autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across numerous real-world and synthetic data sets with heterogeneous feature types coming in various sizes.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Tabular Data Generation Adult Census Income GReaT LR Accuracy 84.77 # 2
DT Accuracy 84.81 # 2
RF Accuracy 85.42 # 2
Parameters(M) 355 # 6
Tabular Data Generation Adult Census Income Distill-GReaT LR Accuracy 84.65 # 3
DT Accuracy 84.49 # 3
RF Accuracy 85.25 # 3
Parameters(M) 82 # 5
Tabular Data Generation California Housing Prices Distill-GReaT Parameters(M) 82 # 5
RF Mean Squared Error 0.32 # 2
LR Mean Squared Error 0.57 # 3
DT Mean Squared Error 0.43 # 2
Tabular Data Generation California Housing Prices GReaT Parameters(M) 355 # 6
RF Mean Squared Error 0.28 # 1
LR Mean Squared Error 0.34 # 1
DT Mean Squared Error 0.39 # 1
Tabular Data Generation Diabetes Distill-GReaT LR Accuracy 0.5733 # 3
DT Accuracy 0.541 # 3
RF Accuracy 0.5803 # 2
Parameters(M) 82 # 5
Tabular Data Generation Diabetes GReaT LR Accuracy 0.5734 # 2
DT Accuracy 0.5523 # 2
RF Accuracy 0.5834 # 1
Parameters(M) 355 # 6
Tabular Data Generation HELOC Distill-GReaT LR Accuracy 70.58 # 4
DT Accuracy 81.4 # 1
RF Accuracy 82.14 # 1
Parameters(M) 82 # 5
Tabular Data Generation HELOC GReaT LR Accuracy 71.9 # 1
DT Accuracy 79.1 # 2
RF Accuracy 80.93 # 2
Parameters(M) 355 # 6
Tabular Data Generation SICK Distill-GReaT LR Accuracy 96.56 # 2
DT Accuracy 95.39 # 3
RF Accuracy 97.72 # 2
Parameters(M) 82 # 5
Tabular Data Generation SICK GReaT LR Accuracy 97.72 # 1
DT Accuracy 97.72 # 1
RF Accuracy 98.3 # 1
Parameters(M) 355 # 6
Tabular Data Generation Travel Distill-GReaT LR Accuracy 78.53 # 4
DT Accuracy 77.38 # 4
RF Accuracy 79.5 # 4
Parameters(M) 82 # 5
Tabular Data Generation Travel GReaT LR Accuracy 80.1 # 2
DT Accuracy 83.56 # 2
RF Accuracy 84.3 # 2
Parameters(M) 355 # 6

Methods


No methods listed for this paper. Add relevant methods here