Skip to content

This repository contains the code and material to reproduce the results of the manuscript "What’s Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models". The paper is currently under review.

Notifications You must be signed in to change notification settings

bips-hb/XAI_syn_data_detection

Repository files navigation

What’s Wrong with Your Synthetic Tabular Data?

This repository contains the code and material to reproduce the results of the manuscript "What’s Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models". The paper is currently under review.

🗂️ Datasets

  • data/: Directory for storing the real and synthetic datasets. This folder contains subfolders with the dataset names, each containing the real datasets (e.g., data/adult_complete/real/) and the synthetic datasets generated by different generative models (e.g., data/adult_complete/syn/synthpop/). As the synthesis of the datasets is computationally expensive, we already provide 10 synthetic datasets versions for all real datasets.

Used datasets:

('N' = Number of instances, 'p' = Number of features, 'Cat' = number of categorical features, 'Cont' = number of numerical features)

  • adult_complete UCI ID: 2 (N = 47876, p = 14, Cat = 8, Cont = 6)
    Note: In comparison with the original dataset, we renamed some variables and classes, dropped the variable education because of redundancy with education_num, and reduced the number of classes in the variable native-country to the top 30 most frequent countries.
  • nursery UCI ID: 76 (N = 12958, p = 9, Cat = 9, Cont = 0)
    Note: We removed all instances with the class recommend in the variable class.
  • car_evaluation UCI ID: 19 (N = 1728, p = 7, Cat = 7, Cont = 0)
  • chess_king_rook_vs_king UCI ID: 23 (N = 28056, p = 7, Cat = 4, Cont = 3)
  • connect_4 UCI ID: 26 (N = 67557, p = 43, Cat = 43, Cont = 0)
  • letter_recognition UCI ID: 59 (N = 20000, p = 17, Cat = 1, Cont = 16)
  • magic_gamma_telescope UCI ID: 159 (N = 19020, p = 11, Cat = 1, Cont = 10)
  • statlog_landsat_satellite UCI ID: 146 (N = 6435, p = 37, Cat = 0, Cont = 37)
  • diabetes Kaggle: mathchi/diabetes-data-set (N = 768, p = 9, Cat = 0, Cont = 9)
  • diabetes_HI Kaggle: alexteboul/diabetes-health-indicators-dataset (N = 253680, p = 22, Cat = 0, Cont = 22)
  • diamonds Kaggle: shivam2503/diamonds (N = 53940, p = 10, Cat = 3, Cont = 7)

An overview of the datasets can be found in the data/datasets_overview.csv file. Additionally, we provide correlation and histrogram plots of the the real and synthetic dataset distribution in the corresponding subfolders data/[DATASET_NAME]/histograms/ and data/[DATASET_NAME]/correlations/.

🛠️ Data Synthesis/Generation

Note: Since it is computationally expensive to synthesize the data, we provide the synthesized datasets in the data/ folder. The scripts are provided for reproducibility purposes. However, the correct installation and setup for running the scripts do we leave to the user.

  • synthesize.sh: Bash script for running the data synthesis scripts from the command line. It calls the R script synthesize_non_dl.R and the Python script synthesize_dl.py, respectively. Warning: This script may take a long time to run, depending on the number of datasets and generative models. Additionally, the script requires the correct installation of the necessary R and Python packages and environments.

  • synthesize_non_dl.R: Script for synthesizing data using non-deep learning generative models in R, i.e., ARF and synthpop (both tree-based).

  • synthesize_dl.py: Script for synthesizing data using deep learning generative models in Python, i.e., TabSyn, CTGAN, CTAB-GAN+ and TVAE.

🧠 Fit detection models

Note: Since it is computationally expensive to fit the detection models, especially xgboost models which involves Bayesian optimization, we provide the fitted models in the models/ folder on all test versions of the datasets adult_complete and nursery as those are analyzed in the paper. The scripts are provided for reproducibility purposes. However, due to the time-limited Bayesian optimization, the xgboost models are not exactly reproducible.

  • fit_models.R: Script for fitting detection models on the real and synthetic datasets. The script fits a Random Forest (using ranger), a Logistic Regression model and a XGBoost model (using xgboost) on the selected datasets. The models are saved in the models/ folder.

📊 IML for Synthetic Data Detection

The following scripts are used to apply IML methods to the fitted detection models. The results are saved in the folder results/ and visualized by running the script create_figures.R. See details in the corresponding scripts:

  • run_pfi.R: Permutation Feature Importance (PFI) method (used for Q1).

  • run_feat_effects.R: Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) plots (used for Q2).

  • run_intershap.R: Interaction Shapley values using TreeSHAP with xgboost models (used for Q1 and Q3).

  • run_condshape.R: Conditional Shapley values using TreeSHAP with xgboost models (used for Q3).

  • run_ce.R, run_ce_extra*.R: Counterfactual Explanations (used for Q4).

🚀 Reproducing the Figures

The figures in the paper can be reproduced by running the following script:

  • create_figures.R: Script for creating the figures used in the paper. The script reads the results from the results/ folder and generates the figures in the figures/ folder.

📚 Requirements

Required R R logo packages

For synthesizers:

  • arf
  • synthpop

For detection models:

  • ranger
  • xgboost
  • ParBayesianOptimization
  • Metrics

For IML:

  • iml
  • mcceR (installed with remotes::install_github("NorskRegnesentral/mcceR"))
  • shapr

For visualization:

  • ggplot2
  • cowplot
  • patchwork
  • flextable
  • shapviz
  • xtable
  • rsvg
  • svglite
  • gggenes
  • ggrepel
  • ggfittext

For Parallel processing:

  • doParallel
  • parallelly
  • parallel
  • foreach

For Data processing and console output:

  • data.table
  • rlang
  • cli

Required Python Python logo packages for synthesizers

  • json
  • tabsyn
  • CTABGANPlus
  • sdv

About

This repository contains the code and material to reproduce the results of the manuscript "What’s Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models". The paper is currently under review.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •