GitHub - UBC-MDS/PrepPy

PrepPy

Package Summary

PrepPy is a package for Python to help preprocessing in machine learning tasks. There are certain repetitive tasks that come up often when doing a machine learning project and this package aims to alleviate those chores. Some of the issues that come up regularly are: finding the types of each column in a dataframe, splitting the data (whether into train/test sets or train/test/validation sets, one-hot encoding, and scaling features. This package will help with all of those tasks.

Installation:

pip install -i https://test.pypi.org/simple/ preppy524

Features

This package has the following features:

train_valid_test_split: This function splits the data set into train, validation, and test sets.
data_type: This function identifies data types for each column/feature. It returns one dataframe for each type of data.
one-hot: This function performs one-hot encoding on the categorical features and returns a dataframe for the train, test, validation sets with sensible column names.
scaler: This function performs standard scaling on the numerical features.

Dependencies

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

Usage

preppy524.datatype module

The data_type() function identifies features of different data types: numeric or categorical.

Input: Pandas DataFrame
Output: A tuple (Pandas DataFrame of numeric features, Pandas DataFrame of categorical features)

from preppy524 import datatype  
datatype.data_type(my_data)

Example:

my_data = pd.DataFrame({'fruits': ['apple', 'banana', 'pear'],
                        'count': [3, 5, 8],
                        'price': [1.0, 6.5, 9.23]})

datatype.data_type(my_data)[0]

	count	price
0	3	1.0
1	5	6.5
2	8	9.23

datatype.data_type(my_data)[1]

	fruits
0	apple
1	banana
2	pear

preppy524.train_valid_test_split module

The train_valid_test_split() splits dataframes into random train, validation and test subsets.

Input: Sequence of Pandas DataFrame of the same length / shape[0]
Output: List containing train, validation and test splits of the input data

from preppy524 import train_valid_test_split  
train_valid_test_split.train_valid_test_split(X, y)

Example:

X, y = np.arange(16).reshape((8, 2)), list(range(8))

X_train, X_valid, X_test, y_train, y_valid, y_test =
            train_valid_test_split.train_valid_test_split(X,
                                                          y,
                                                          test_size=0.25,
                                                          valid_size=0.25,
                                                          random_state=777)
                                                          
y_train

[3, 0, 2, 5]

preppy524.onehot module

The onehot() function encodes features of categorical type.

Input: List of categorical features, Train set, Validation set, Test set (Pandas DataFrames)
Output: Encoded Pandas DataFrame

from preppy524 import onehot
onehot.onehot(cols=['catgorical_columns'], train=my_data)

Example:

onehot.onehot(['fruits'], my_data)['train']

	apple	banana	pear
0	1	0	0
1	0	1	0
2	0	0	1

preppy524.scaler module

The scaler() performs standard scaling of numeric features.

Input: Train set, Validation set, Test set (Pandas DataFrames), List of numeric features
Output: Dictionary of transformed sets (Pandas DataFrames)

from preppy524 import scaler
scaler.scaler(x_train, x_validation, x_test, colnames)

Example:

scaler.scaler(my_data, my_data, my_data, ['count'])['x_validation']

	count
0	-0.927
1	-0.132
2	1.059

Our package in the Python ecosystem

Many of the functions in this package can also be done using the various functions of sklearn. However, some of the functions in sklearn take multiple steps to complete what our package can do in one line. For example, if one wants to split a dataset into train, test, and validation sets, they would need to use sklearn's train_test_split twice. This package's train_test_val_split allows users to do this more efficiently. Further, the one-hot encoder in sklearn does not make sensible column names unless the user does some wrangling. The one-hot function in this package will implement sklearn's one-hot encoder, but will wrangle the columns and name them automatically. Overall, this package fits in well with the Python ecosystem and can help make machine learning a little easier.

Documentation

The official documentation is hosted on Read the Docs: https://preppy524.readthedocs.io/en/latest/

Credits

This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
.github/workflows		.github/workflows
docs		docs
preppy524		preppy524
tests		tests
.Rhistory		.Rhistory
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
=0.22.1		=0.22.1
CONDUCT.md		CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrepPy

Package Summary

Installation:

Features

Dependencies

Usage

preppy524.datatype module

preppy524.train_valid_test_split module

preppy524.onehot module

preppy524.scaler module

Our package in the Python ecosystem

Documentation

Credits

About

Releases 14

Packages

Contributors 4

Languages

License

UBC-MDS/PrepPy

Folders and files

Latest commit

History

Repository files navigation

PrepPy

Package Summary

Installation:

Features

Dependencies

Usage

preppy524.datatype module

preppy524.train_valid_test_split module

preppy524.onehot module

preppy524.scaler module

Our package in the Python ecosystem

Documentation

Credits

About

Resources

License

Stars

Watchers

Forks

Releases 14

Packages 0

Contributors 4

Languages

Packages