This repository contains the code and material to reproduce the results of the manuscript "What’s Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models". The paper is currently under review.
data/
: Directory for storing the real and synthetic datasets. This folder contains subfolders with the dataset names, each containing the real datasets (e.g.,data/adult_complete/real/
) and the synthetic datasets generated by different generative models (e.g.,data/adult_complete/syn/synthpop/
). As the synthesis of the datasets is computationally expensive, we already provide 10 synthetic datasets versions for all real datasets.
Used datasets:
('N' = Number of instances, 'p' = Number of features, 'Cat' = number of categorical features, 'Cont' = number of numerical features)
adult_complete
UCI ID: 2 (N = 47876, p = 14, Cat = 8, Cont = 6)
Note: In comparison with the original dataset, we renamed some variables and classes, dropped the variableeducation
because of redundancy witheducation_num
, and reduced the number of classes in the variablenative-country
to the top 30 most frequent countries.nursery
UCI ID: 76 (N = 12958, p = 9, Cat = 9, Cont = 0)
Note: We removed all instances with the classrecommend
in the variableclass
.car_evaluation
UCI ID: 19 (N = 1728, p = 7, Cat = 7, Cont = 0)chess_king_rook_vs_king
UCI ID: 23 (N = 28056, p = 7, Cat = 4, Cont = 3)connect_4
UCI ID: 26 (N = 67557, p = 43, Cat = 43, Cont = 0)letter_recognition
UCI ID: 59 (N = 20000, p = 17, Cat = 1, Cont = 16)magic_gamma_telescope
UCI ID: 159 (N = 19020, p = 11, Cat = 1, Cont = 10)statlog_landsat_satellite
UCI ID: 146 (N = 6435, p = 37, Cat = 0, Cont = 37)diabetes
Kaggle: mathchi/diabetes-data-set (N = 768, p = 9, Cat = 0, Cont = 9)diabetes_HI
Kaggle: alexteboul/diabetes-health-indicators-dataset (N = 253680, p = 22, Cat = 0, Cont = 22)diamonds
Kaggle: shivam2503/diamonds (N = 53940, p = 10, Cat = 3, Cont = 7)
An overview of the datasets can be found in the data/datasets_overview.csv
file. Additionally, we provide correlation and histrogram plots of the the real
and synthetic dataset distribution in the corresponding subfolders
data/[DATASET_NAME]/histograms/
and data/[DATASET_NAME]/correlations/
.
Note: Since it is computationally expensive to synthesize the data, we provide
the synthesized datasets in the data/
folder. The scripts are provided for
reproducibility purposes. However, the correct installation and setup for
running the scripts do we leave to the user.
-
synthesize.sh
: Bash script for running the data synthesis scripts from the command line. It calls the R scriptsynthesize_non_dl.R
and the Python scriptsynthesize_dl.py
, respectively. Warning: This script may take a long time to run, depending on the number of datasets and generative models. Additionally, the script requires the correct installation of the necessary R and Python packages and environments. -
synthesize_non_dl.R
: Script for synthesizing data using non-deep learning generative models in R, i.e., ARF and synthpop (both tree-based). -
synthesize_dl.py
: Script for synthesizing data using deep learning generative models in Python, i.e., TabSyn, CTGAN, CTAB-GAN+ and TVAE.
Note: Since it is computationally expensive to fit the detection models,
especially xgboost models which involves Bayesian optimization, we provide the
fitted models in the models/
folder on all test versions of the datasets
adult_complete
and nursery
as those are analyzed in the paper. The scripts
are provided for reproducibility purposes. However, due to the time-limited
Bayesian optimization, the xgboost models are not exactly reproducible.
fit_models.R
: Script for fitting detection models on the real and synthetic datasets. The script fits a Random Forest (usingranger
), a Logistic Regression model and a XGBoost model (usingxgboost
) on the selected datasets. The models are saved in themodels/
folder.
The following scripts are used to apply IML methods to the fitted detection
models. The results are saved in the folder results/
and visualized by
running the script create_figures.R
.
See details in the corresponding scripts:
-
run_pfi.R
: Permutation Feature Importance (PFI) method (used for Q1). -
run_feat_effects.R
: Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) plots (used for Q2). -
run_intershap.R
: Interaction Shapley values using TreeSHAP with xgboost models (used for Q1 and Q3). -
run_condshape.R
: Conditional Shapley values using TreeSHAP with xgboost models (used for Q3). -
run_ce.R
,run_ce_extra*.R
: Counterfactual Explanations (used for Q4).
The figures in the paper can be reproduced by running the following script:
create_figures.R
: Script for creating the figures used in the paper. The script reads the results from theresults/
folder and generates the figures in thefigures/
folder.
For synthesizers:
arf
synthpop
For detection models:
ranger
xgboost
ParBayesianOptimization
Metrics
For IML:
iml
mcceR
(installed withremotes::install_github("NorskRegnesentral/mcceR")
)shapr
For visualization:
ggplot2
cowplot
patchwork
flextable
shapviz
xtable
rsvg
svglite
gggenes
ggrepel
ggfittext
For Parallel processing:
doParallel
parallelly
parallel
foreach
For Data processing and console output:
data.table
rlang
cli
json
tabsyn
CTABGANPlus
sdv