What’s Wrong with Your Synthetic Tabular Data?

This repository contains the code and material to reproduce the results of the manuscript "What’s Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models". The paper is currently under review.

🗂️ Datasets

data/: Directory for storing the real and synthetic datasets. This folder contains subfolders with the dataset names, each containing the real datasets (e.g., data/adult_complete/real/) and the synthetic datasets generated by different generative models (e.g., data/adult_complete/syn/synthpop/). As the synthesis of the datasets is computationally expensive, we already provide 10 synthetic datasets versions for all real datasets.

Used datasets:

('N' = Number of instances, 'p' = Number of features, 'Cat' = number of categorical features, 'Cont' = number of numerical features)

adult_complete UCI ID: 2 (N = 47876, p = 14, Cat = 8, Cont = 6)
Note: In comparison with the original dataset, we renamed some variables and classes, dropped the variable education because of redundancy with education_num, and reduced the number of classes in the variable native-country to the top 30 most frequent countries.
nursery UCI ID: 76 (N = 12958, p = 9, Cat = 9, Cont = 0)
Note: We removed all instances with the class recommend in the variable class.
car_evaluation UCI ID: 19 (N = 1728, p = 7, Cat = 7, Cont = 0)
chess_king_rook_vs_king UCI ID: 23 (N = 28056, p = 7, Cat = 4, Cont = 3)
connect_4 UCI ID: 26 (N = 67557, p = 43, Cat = 43, Cont = 0)
letter_recognition UCI ID: 59 (N = 20000, p = 17, Cat = 1, Cont = 16)
magic_gamma_telescope UCI ID: 159 (N = 19020, p = 11, Cat = 1, Cont = 10)
statlog_landsat_satellite UCI ID: 146 (N = 6435, p = 37, Cat = 0, Cont = 37)
diabetes Kaggle: mathchi/diabetes-data-set (N = 768, p = 9, Cat = 0, Cont = 9)
diabetes_HI Kaggle: alexteboul/diabetes-health-indicators-dataset (N = 253680, p = 22, Cat = 0, Cont = 22)
diamonds Kaggle: shivam2503/diamonds (N = 53940, p = 10, Cat = 3, Cont = 7)

An overview of the datasets can be found in the data/datasets_overview.csv file. Additionally, we provide correlation and histrogram plots of the the real and synthetic dataset distribution in the corresponding subfolders data/[DATASET_NAME]/histograms/ and data/[DATASET_NAME]/correlations/.

🛠️ Data Synthesis/Generation

Note: Since it is computationally expensive to synthesize the data, we provide the synthesized datasets in the data/ folder. The scripts are provided for reproducibility purposes. However, the correct installation and setup for running the scripts do we leave to the user.

synthesize.sh: Bash script for running the data synthesis scripts from the command line. It calls the R script synthesize_non_dl.R and the Python script synthesize_dl.py, respectively. Warning: This script may take a long time to run, depending on the number of datasets and generative models. Additionally, the script requires the correct installation of the necessary R and Python packages and environments.
synthesize_non_dl.R: Script for synthesizing data using non-deep learning generative models in R, i.e., ARF and synthpop (both tree-based).
synthesize_dl.py: Script for synthesizing data using deep learning generative models in Python, i.e., TabSyn, CTGAN, CTAB-GAN+ and TVAE.

🧠 Fit detection models

Note: Since it is computationally expensive to fit the detection models, especially xgboost models which involves Bayesian optimization, we provide the fitted models in the models/ folder on all test versions of the datasets adult_complete and nursery as those are analyzed in the paper. The scripts are provided for reproducibility purposes. However, due to the time-limited Bayesian optimization, the xgboost models are not exactly reproducible.

fit_models.R: Script for fitting detection models on the real and synthetic datasets. The script fits a Random Forest (using ranger), a Logistic Regression model and a XGBoost model (using xgboost) on the selected datasets. The models are saved in the models/ folder.

📊 IML for Synthetic Data Detection

The following scripts are used to apply IML methods to the fitted detection models. The results are saved in the folder results/ and visualized by running the script create_figures.R. See details in the corresponding scripts:

run_pfi.R: Permutation Feature Importance (PFI) method (used for Q1).
run_feat_effects.R: Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) plots (used for Q2).
run_intershap.R: Interaction Shapley values using TreeSHAP with xgboost models (used for Q1 and Q3).
run_condshape.R: Conditional Shapley values using TreeSHAP with xgboost models (used for Q3).
run_ce.R, run_ce_extra*.R: Counterfactual Explanations (used for Q4).

🚀 Reproducing the Figures

The figures in the paper can be reproduced by running the following script:

create_figures.R: Script for creating the figures used in the paper. The script reads the results from the results/ folder and generates the figures in the figures/ folder.

📚 Requirements

Required R packages

For synthesizers:

arf
synthpop

For detection models:

ranger
xgboost
ParBayesianOptimization
Metrics

For IML:

iml
mcceR (installed with remotes::install_github("NorskRegnesentral/mcceR"))
shapr

For visualization:

ggplot2
cowplot
patchwork
flextable
shapviz
xtable
rsvg
svglite
gggenes
ggrepel
ggfittext

For Parallel processing:

doParallel
parallelly
parallel
foreach

For Data processing and console output:

data.table
rlang
cli

Required Python packages for synthesizers

json
tabsyn
CTABGANPlus
sdv

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
data		data
figures		figures
models/xgboost		models/xgboost
results		results
tables/Q4		tables/Q4
.gitignore		.gitignore
README.md		README.md
create_figures.R		create_figures.R
fit_models.R		fit_models.R
prepare_local_methods.R		prepare_local_methods.R
run_ce.R		run_ce.R
run_ce_extra.R		run_ce_extra.R
run_ce_extra2.R		run_ce_extra2.R
run_ce_extra3.R		run_ce_extra3.R
run_condshap.R		run_condshap.R
run_feat_effects.R		run_feat_effects.R
run_intershap.R		run_intershap.R
run_pfi.R		run_pfi.R
synthesize.sh		synthesize.sh
synthesize_dl.py		synthesize_dl.py
synthesize_non_dl.R		synthesize_non_dl.R
utils.R		utils.R
utils_dl.py		utils_dl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

What’s Wrong with Your Synthetic Tabular Data?

🗂️ Datasets

🛠️ Data Synthesis/Generation

🧠 Fit detection models

📊 IML for Synthetic Data Detection

🚀 Reproducing the Figures

📚 Requirements

Required R packages

Required Python packages for synthesizers

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

bips-hb/XAI_syn_data_detection

Folders and files

Latest commit

History

Repository files navigation

What’s Wrong with Your Synthetic Tabular Data?

🗂️ Datasets

🛠️ Data Synthesis/Generation

🧠 Fit detection models

📊 IML for Synthetic Data Detection

🚀 Reproducing the Figures

📚 Requirements

Required R packages

Required Python packages for synthesizers

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages