`ssbgm`: Scikit-learn-based Score Based Generative Model

ssbgm is a python library which enables you to generate synthetic data using a score based generative model with scikit-learn.

You can use ssbgm to predict a target value with some features by generating synthetic data given the features.

Installation

Requirements

Python (>= 3.10)
libraries:
- catboost>=1.2.7
- lightgbm>=4.5.0
- numpy>=1.26.4
- scikit-learn>=1.5.2
- tqdm>=4.67.0
- types-tqdm>=4.66.0.20240417

See ./pyproject.toml for more details.

How to Install

You can install ssbgm via pip:

pip install git+https://github.com/hmasdev/ssbgm.git

or

git clone https://github.com/hmasdev/ssbgm.git
pip install .

Usage

Generate Synthetic Data

Here is an example of generating synthetic data using ssbgm:

from sklearn.linear_model import LinearRegression
from ssbgm import ScoreBasedGenerator

# Prepare the dataset which you want to generate synthetic data
# row: sample, column: output dimension
X: np.ndarray = ...

# initialize the generator with LinearRegression
generator = ScoreBasedGenerator(estimator=LinearRegression())

# fit the generator
generator.fit(X)

# generate synthetic data
# Langevin Monte Carlo is used to generate synthetic data
X_syn_lmc = sbmgenerator2.sample(n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.LANGEVIN_MONTECARLO, alpha=0.2).squeeze()
X_syn_euler = sbmgenerator2.sample(n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.EULER).squeeze()
X_syn_em = sbmgenerator2.sample(n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.EULER_MARUYAMA).squeeze()
# The shape of each X_syn_* is (128, X.shape[1])

Conditional Generation

You can use ssbgm to predict a target value with some features by generating synthetic data given the features.

from sklearn.linear_model import LinearRegression
from ssbgm import ScoreBasedGenerator

# Prepare the dataset which you want to generate synthetic data
# row: sample, column: features
X: np.ndarray = ...
# row: sample, column: target value
y: np.ndarray = ...

# initialize the generator with LinearRegression
generator = ScoreBasedGenerator(estimator=LinearRegression())

# fit the generator
generator.fit(X, y)

# predict the target value with on X
y_pred_by_mean, y_pred_std = generator.predict(X, aggregate='mean', return_std=True)  # Shape: (X.shape[0], y.shape[1]), (X.shape[0], y.shape[1])
y_pred_by_median = generator.predict(X, aggregate='median')  # Shape: (X.shape[0], y.shape[1])

# generate synthetic data conditioned by X
# Langevin Monte Carlo is used to generate synthetic data
X_syn_lmc = sbmgenerator2.sample(X, n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.LANGEVIN_MONTECARLO, alpha=0.2, n_warmup=1000).squeeze()
X_syn_euler = sbmgenerator2.sample(X, n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.EULER).squeeze()
X_syn_em = sbmgenerator2.sample(X, n_samples=128, sampling_method=sbmgenerator2.SamplingMethod.EULER_MARUYAMA).squeeze()
# The shape of each X_syn_* is (128, X.shape[0], X.shape[1])

Examples

In this section, we will see some examples of using ssbgm.

If you want to know more details, see ./samples directory. Especially, ./samples/cheatsheet.ipynb is a good starting point.

Mixed Gaussian Distribution

See ./samples/mixed_gaussian_distribution.ipynb for more details.

# import libraries
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
import matplotlib.pyplot as plt  # Installing matplotlib is required
import numpy as np

import sys
sys.path.append('../')
from ssbgm import ScoreBasedGenerator

np.random.seed(0)
N = 10000

# Case: 1d mixed gaussian

# generate a training dataset
x_train = np.random.randn(N) + (2*(np.random.rand(N) > 0.5) - 1) * 1.6

# train a generative model with score-based model
generative_model_1d_mixed_gaussian = ScoreBasedGenerator(LGBMRegressor(random_state=42)).fit(x_train, noise_strengths=np.sqrt(np.logspace(-3, np.log(x_train.var()), 101)))

# generate samples from the trained model
x_gen = generative_model_1d_mixed_gaussian.sample(n_samples=N, sampling_method=ScoreBasedGenerator.SamplingMethod.EULER).squeeze()

# plot the results
true_pdf = lambda x: 0.5*np.exp(-0.5*(x-1.6)**2)/np.sqrt(2*np.pi) + 0.5*np.exp(-0.5*(x+1.6)**2)/np.sqrt(2*np.pi)
plt.hist(x_train, bins=30, label='train data', color='blue', alpha=0.5, density=True)
plt.hist(x_gen, bins=30, label='generated data', color='red', alpha=0.5, density=True)
plt.plot(np.linspace(x_train.min(), x_train.max()), true_pdf(np.linspace(x_train.min(), x_train.max())), 'k-', label='true pdf')
plt.legend(loc='upper left')
plt.show()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004026 seconds. You can set force_row_wise=true to remove the overhead. And if memory is not enough, you can set force_col_wise=true. [LightGBM] [Info] Total Bins 357 [LightGBM] [Info] Number of data points in the train set: 1010000, number of used features: 2 [LightGBM] [Info] Start training from score 0.000087

# Case: 2d mixed gaussian

# generate a training dataset
X_train = np.random.randn(N, 2)
label = 2*(np.random.rand(N) > 0.5) - 1
X_train[:, 0] = X_train[:, 0] + label * 1.6
X_train[:, 1] = X_train[:, 1] + label * 1.6

# train a generative model with score-based model
generative_model_2d_mixed_gaussian = ScoreBasedGenerator(
    estimator=CatBoostRegressor(
        verbose=0,
        loss_function='MultiRMSE',
        random_state=42,
    )
)
generative_model_2d_mixed_gaussian.fit(
    X_train,
    noise_strengths=np.sqrt(np.logspace(-3, np.log(max(np.var(X_train, axis=0))), 11)),
)

# generate samples from the trained model
X_gen = generative_model_2d_mixed_gaussian.sample(n_samples=N, sampling_method=ScoreBasedGenerator.SamplingMethod.EULER).squeeze()

# plot the results
true_pdf = lambda X: 0.5*np.exp(-0.5*(X[:, 0]-1.6)**2 - 0.5*(X[:, 1]-1.6)**2)/2/np.pi + 0.5*np.exp(-0.5*(X[:, 0]+1.6)**2 - 0.5*(X[:, 1]+1.6)**2)/2/np.pi
XX_, YY_ = np.meshgrid(np.linspace(X_train[:, 0].min(), X_train[:, 0].max()), np.linspace(X_train[:, 1].min(), X_train[:, 1].max()))
plt.scatter(X_train[:, 0], X_train[:, 1], label='train data', color='blue', alpha=0.2, marker='x')
plt.scatter(X_gen[:, 0], X_gen[:, 1], label='generated data', color='red', alpha=0.2, marker='o')
plt.contourf(XX_, YY_, true_pdf(np.c_[XX_.ravel(), YY_.ravel()]).reshape(XX_.shape), alpha=0.5)
plt.legend(loc='upper left')
plt.xlim(X_train[:, 0].min(), X_train[:, 0].max())
plt.ylim(X_train[:, 1].min(), X_train[:, 1].max())
plt.show()

How to Develop

Fork the repository: https://github.com/hmasdev/ssbgm

Clone the repository

git clone https://github.com/{YOURE_NAME}/ssbgm
cd ssbgm

Create a virtual environment

python -m venv venv
source venv/bin/activate

Install the required packages
```
pip install -e .[dev]
```
Checkout your working branch
```
git checkout -b your-working-branch
```
Make your changes

Test your changes

pytest
flake8 ssbgm tests
mypy ssbgm tests

Commit your changes

git add .
git commit -m "Your commit message"

Push your changes
```
git push origin your-working-branch
```
Create a pull request: https://github.com/hmasdev/ssbgm/compare

License

MIT

Author

hmasdev

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
pics		pics
samples		samples
ssbgm		ssbgm
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`ssbgm`: Scikit-learn-based Score Based Generative Model

Installation

Requirements

How to Install

Usage

Generate Synthetic Data

Conditional Generation

Examples

Mixed Gaussian Distribution

How to Develop

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

hmasdev/ssbgm

Folders and files

Latest commit

History

Repository files navigation

ssbgm: Scikit-learn-based Score Based Generative Model

Installation

Requirements

How to Install

Usage

Generate Synthetic Data

Conditional Generation

Examples

Mixed Gaussian Distribution

How to Develop

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`ssbgm`: Scikit-learn-based Score Based Generative Model

Packages