Skip to content

Commit ca28618

Browse files
Add MeanNormalisationScaler (#806)
* first version of mean normalization * augment coverage * changes after review * add new tests and fix after review * second update after discussion * add mean normalization to the docs * improve docstrings * devide _params into _mean and _var * deleted formula from docstring * add scaling into index * fix flake8 * Update docs/index.rst Co-authored-by: Soledad Galli <[email protected]> * Update docs/index.rst Co-authored-by: Soledad Galli <[email protected]> * Update feature_engine/scaling/mean_normalization.py Co-authored-by: Soledad Galli <[email protected]> * Update feature_engine/scaling/mean_normalization.py Co-authored-by: Soledad Galli <[email protected]> * Update feature_engine/scaling/mean_normalization.py Co-authored-by: Soledad Galli <[email protected]> * change to dictionaries * update docs with demo * fix * fix * fix * minor rewording here and there --------- Co-authored-by: Soledad Galli <[email protected]>
1 parent 3dcc864 commit ca28618

File tree

12 files changed

+588
-0
lines changed

12 files changed

+588
-0
lines changed

README.md

+4
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ Please share your story by answering 1 quick question
6868
* Datetime Features
6969
* Time Series
7070
* Preprocessing
71+
* Scaling
7172
* Scikit-learn Wrappers
7273

7374
### Imputation Methods
@@ -110,6 +111,9 @@ Please share your story by answering 1 quick question
110111
* BoxCoxTransformer
111112
* YeoJohnsonTransformer
112113

114+
### Variable Scaling methods
115+
* MeanNormalizationScaler
116+
113117
### Variable Creation:
114118
* MathFeatures
115119
* RelativeFeatures

docs/api_doc/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ Other
4848
:maxdepth: 1
4949

5050
preprocessing/index
51+
scaling/index
5152
wrappers/index
5253

5354
Pipeline
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
MeanNormalizationScaler
2+
=======================
3+
4+
.. autoclass:: feature_engine.scaling.MeanNormalizationScaler
5+
:members:
6+

docs/api_doc/scaling/index.rst

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
.. -*- mode: rst -*-
2+
3+
Scaling
4+
=======
5+
6+
Feature-engine's scaling transformers apply various scaling techniques to
7+
given columns
8+
9+
.. toctree::
10+
:maxdepth: 1
11+
12+
MeanNormalizationScaler

docs/index.rst

+10
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ Feature-engine includes transformers for:
6767
- Datetime features
6868
- Time series
6969
- Preprocessing
70+
- Scaling
7071

7172
Feature-engine transformers are fully compatible with scikit-learn. That means that you can assemble Feature-engine
7273
transformers within a Scikit-learn pipeline, or use them in a grid or random search for hyperparameters.
@@ -296,6 +297,15 @@ types and variable names match.
296297
- :doc:`api_doc/preprocessing/MatchCategories`: ensures categorical variables are of type 'category'
297298
- :doc:`api_doc/preprocessing/MatchVariables`: ensures that columns in test set match those in train set
298299

300+
Scaling:
301+
~~~~~~~~
302+
303+
Scaling the data can help to balance the impact of all variables on the model, and can improve
304+
its performance.
305+
306+
- :doc:`api_doc/scaling/MeanNormalizationScaler`: scale variables using mean normalization
307+
308+
299309
Scikit-learn Wrapper:
300310
~~~~~~~~~~~~~~~~~~~~~
301311

docs/user_guide/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Transformation
1818
discretisation/index
1919
outliers/index
2020
transformation/index
21+
scaling/index
2122

2223
Creation
2324
--------
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
.. _mean_normalization_scaler:
2+
3+
.. currentmodule:: feature_engine.scaling
4+
5+
MeanNormalizationScaler
6+
=======================
7+
8+
:class:`MeanNormalizationScaler()` scales variables using mean normalization. With mean normalization,
9+
we center the distribution around 0, and rescale the distribution to the variable's value range,
10+
so that its values vary between -1 and 1. This is accomplished by subtracting the mean of the feature
11+
and then dividing by its range (i.e., the difference between the maximum and minimum values).
12+
13+
The :class:`MeanNormalizationScaler()` only works with non-constant numerical variables.
14+
If the variable is constant, the scaler will raise an error.
15+
16+
Python example
17+
--------------
18+
19+
We'll show how to use :class:`MeanNormalizationScaler()` through a toy dataset. Let's create
20+
a toy dataset:
21+
22+
.. code:: python
23+
24+
import pandas as pd
25+
from feature_engine.scaling import MeanNormalizationScaler
26+
27+
df = pd.DataFrame.from_dict(
28+
{
29+
"Name": ["tom", "nick", "krish", "jack"],
30+
"City": ["London", "Manchester", "Liverpool", "Bristol"],
31+
"Age": [20, 21, 19, 18],
32+
"Height": [1.80, 1.77, 1.90, 2.00],
33+
"Marks": [0.9, 0.8, 0.7, 0.6],
34+
"dob": pd.date_range("2020-02-24", periods=4, freq="min"),
35+
})
36+
37+
print(df)
38+
39+
The dataset looks like this:
40+
41+
.. code:: python
42+
43+
Name City Age Height Marks dob
44+
0 tom London 20 1.80 0.9 2020-02-24 00:00:00
45+
1 nick Manchester 21 1.77 0.8 2020-02-24 00:01:00
46+
2 krish Liverpool 19 1.90 0.7 2020-02-24 00:02:00
47+
3 jack Bristol 18 2.00 0.6 2020-02-24 00:03:00
48+
49+
We see that the only numerical features in this dataset are **Age**, **Marks**, and **Height**. We want
50+
to scale them using mean normalization.
51+
52+
First, let's make a list with the variable names:
53+
54+
.. code:: python
55+
56+
vars = [
57+
'Age',
58+
'Marks',
59+
'Height',
60+
]
61+
62+
Now, let's set up :class:`MeanNormalizationScaler()`:
63+
64+
.. code:: python
65+
66+
# set up the scaler
67+
scaler = MeanNormalizationScaler(variables = vars)
68+
69+
# fit the scaler
70+
scaler.fit(df)
71+
72+
The scaler learns the mean of every column in *vars* and their respective range.
73+
Note that we can access these values in the following way:
74+
75+
.. code:: python
76+
77+
# access the parameters learned by the scaler
78+
print(f'Means: {scaler.mean_}')
79+
print(f'Ranges: {scaler.range_}')
80+
81+
We see the features' mean and value ranges in the following output:
82+
83+
.. code:: python
84+
85+
Means: {'Age': 19.5, 'Marks': 0.7500000000000001, 'Height': 1.8675000000000002}
86+
Ranges: {'Age': 3.0, 'Marks': 0.30000000000000004, 'Height': 0.22999999999999998}
87+
88+
We can now go ahead and scale the variables:
89+
90+
.. code:: python
91+
92+
# scale the data
93+
df = scaler.transform(df)
94+
print(df)
95+
96+
In the following output, we can see the scaled variables:
97+
98+
.. code:: python
99+
100+
Name City Age Height Marks dob
101+
0 tom London 0.166667 -0.293478 0.500000 2020-02-24 00:00:00
102+
1 nick Manchester 0.500000 -0.423913 0.166667 2020-02-24 00:01:00
103+
2 krish Liverpool -0.166667 0.141304 -0.166667 2020-02-24 00:02:00
104+
3 jack Bristol -0.500000 0.576087 -0.500000 2020-02-24 00:03:00
105+
106+
We can restore the data to itsoriginal values using the inverse transformation:
107+
108+
.. code:: python
109+
110+
# inverse transform the dataframe
111+
df = scaler.inverse_transform(df)
112+
print(df)
113+
114+
In the following data, we see the scaled variables returned to their oridinal representation:
115+
116+
.. code:: python
117+
118+
Name City Age Height Marks dob
119+
0 tom London 20 1.80 0.9 2020-02-24 00:00:00
120+
1 nick Manchester 21 1.77 0.8 2020-02-24 00:01:00
121+
2 krish Liverpool 19 1.90 0.7 2020-02-24 00:02:00
122+
3 jack Bristol 18 2.00 0.6 2020-02-24 00:03:00
123+
124+
125+
Additional resources
126+
--------------------
127+
128+
For more details about this and other feature engineering methods check out
129+
these resources:
130+
131+
132+
.. figure:: ../../images/feml.png
133+
:width: 300
134+
:figclass: align-center
135+
:align: left
136+
:target: https://www.trainindata.com/p/feature-engineering-for-machine-learning
137+
138+
Feature Engineering for Machine Learning
139+
140+
|
141+
|
142+
|
143+
|
144+
|
145+
|
146+
|
147+
|
148+
|
149+
|
150+
151+
Or read our book:
152+
153+
.. figure:: ../../images/cookbook.png
154+
:width: 200
155+
:figclass: align-center
156+
:align: left
157+
:target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587
158+
159+
Python Feature Engineering Cookbook
160+
161+
|
162+
|
163+
|
164+
|
165+
|
166+
|
167+
|
168+
|
169+
|
170+
|
171+
|
172+
|
173+
|
174+
175+
Both our book and course are suitable for beginners and more advanced data scientists
176+
alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.

docs/user_guide/scaling/index.rst

+59
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
.. -*- mode: rst -*-
2+
.. _scaling_user_guide:
3+
4+
.. currentmodule:: feature_engine.scaling
5+
6+
Scaling
7+
=======
8+
9+
`Feature scaling <https://www.blog.trainindata.com/feature-scaling-in-machine-learning/>`_
10+
is the process of transforming the range of numerical features so that they fit within a
11+
specific scale, usually to improve the performance and training stability of machine learning
12+
models.
13+
14+
Scaling helps to normalize the input data, ensuring that each feature contributes proportionately
15+
to the final result, particularly in algorithms that are sensitive to the range of the data,
16+
such as gradient descent-based models (e.g., linear regression, logistic regression, neural networks)
17+
and distance-based models (e.g., K-nearest neighbors, clustering).
18+
19+
Feature-engine's scalers replace the variables' values by the scaled ones. In this page, we
20+
discuss the importance of scaling numerical features, and then introduce the various
21+
scaling techniques supported by Feature-engine.
22+
23+
Importance of scaling
24+
---------------------
25+
26+
Scaling is crucial in machine learning as it ensures that features contribute equally to model
27+
training, preventing bias toward variables with larger ranges. Properly scaled data enhances the
28+
performance of algorithms sensitive to the magnitude of input values, such as gradient descent
29+
and distance-based methods. Additionally, scaling can improve convergence speed and overall model
30+
accuracy, leading to more reliable predictions.
31+
32+
33+
When apply scaling
34+
------------------
35+
36+
- **Training:** Most machine learning algorithms require data to be scaled before training,
37+
especially linear models, neural networks, and distance-based models.
38+
39+
- **Feature Engineering:** Scaling can be essential for certain feature engineering techniques,
40+
like polynomial features.
41+
42+
- **Resampling:** Some oversampling methods like SMOTE and many of the undersampling methods
43+
clean data based on KNN algorithms, which are distance based models.
44+
45+
46+
When Scaling Is Not Necessary
47+
-----------------------------
48+
49+
Not all algorithms require scaling. For example, tree-based algorithms (like Decision Trees,
50+
Random Forests, Gradient Boosting) are generally invariant to scaling because they split data
51+
based on the order of values, not the magnitude.
52+
53+
Scalers
54+
-------
55+
56+
.. toctree::
57+
:maxdepth: 1
58+
59+
MeanNormalizationScaler

feature_engine/scaling/__init__.py

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
"""
2+
The module scaling includes classes to transform variables using various
3+
scaling methods.
4+
"""
5+
6+
from .mean_normalization import MeanNormalizationScaler
7+
8+
__all__ = [
9+
"MeanNormalizationScaler",
10+
]

0 commit comments

Comments
 (0)