Skip to content

Commit 81dbe33

Browse files
committed
[ci skip] Add tip to read about TargetEncoder in the documentation (#811) 66af6c1
1 parent 263fa8f commit 81dbe33

5 files changed

+141
-18
lines changed

_sources/python_scripts/03_categorical_pipeline_ex_02.py

+45
Original file line numberDiff line numberDiff line change
@@ -110,3 +110,48 @@
110110

111111
# %%
112112
# Write your code here.
113+
114+
# %% [markdown]
115+
# ### Analysis
116+
#
117+
# From an accuracy point of view, the result is almost exactly the same. The
118+
# reason is that `HistGradientBoostingClassifier` is expressive and robust
119+
# enough to deal with misleading ordering of integer coded categories (which was
120+
# not the case for linear models).
121+
#
122+
# However from a computation point of view, the training time is much longer:
123+
# this is caused by the fact that `OneHotEncoder` generates more features than
124+
# `OrdinalEncoder`; for each unique categorical value a column is created.
125+
#
126+
# Note that the current implementation `HistGradientBoostingClassifier` is still
127+
# incomplete, and once sparse representation are handled correctly, training
128+
# time might improve with such kinds of encodings.
129+
#
130+
# The main take away message is that arbitrary integer coding of categories is
131+
# perfectly fine for `HistGradientBoostingClassifier` and yields fast training
132+
# times.
133+
134+
# Which encoder should I use?
135+
#
136+
# | | Meaningful order | Non-meaningful order |
137+
# | ---------------- | ----------------------------- | -------------------- |
138+
# | Tree-based model | `OrdinalEncoder` | `OrdinalEncoder` with reasonable depth |
139+
# | Linear model | `OrdinalEncoder` with caution | `OneHotEncoder` |
140+
141+
# %% [markdown]
142+
# ```{important}
143+
#
144+
# - `OneHotEncoder`: always does something meaningful, but can be unnecessary
145+
# slow with trees.
146+
# - `OrdinalEncoder`: can be detrimental for linear models unless your category
147+
# has a meaningful order and you make sure that `OrdinalEncoder` respects this
148+
# order. Trees can deal with `OrdinalEncoder` fine as long as they are deep
149+
# enough. However, when you allow the decision tree to grow very deep, it might
150+
# overfit on other features.
151+
# ```
152+
# %% [markdown]
153+
# Next to one-hot-encoding and ordinal encoding categorical features,
154+
# scikit-learn offers the [`TargetEncoder`](https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder).
155+
# This encoder is well suited for nominal, categorical features with high
156+
# cardinality. This encoding strategy is beyond the scope of this course,
157+
# but the interested reader is encouraged to explore this encoder.

_sources/python_scripts/03_categorical_pipeline_sol_02.py

+16-7
Original file line numberDiff line numberDiff line change
@@ -179,8 +179,8 @@
179179
# not the case for linear models).
180180
#
181181
# However from a computation point of view, the training time is much longer:
182-
# this is caused by the fact that `OneHotEncoder` generates approximately 10
183-
# times more features than `OrdinalEncoder`.
182+
# this is caused by the fact that `OneHotEncoder` generates more features than
183+
# `OrdinalEncoder`; for each unique categorical value a column is created.
184184
#
185185
# Note that the current implementation `HistGradientBoostingClassifier` is still
186186
# incomplete, and once sparse representation are handled correctly, training
@@ -190,19 +190,28 @@
190190
# perfectly fine for `HistGradientBoostingClassifier` and yields fast training
191191
# times.
192192

193-
# %% [markdown] tags=["solution"]
194-
# ```{important}
195-
# Which encoder should I use?
193+
# %% [markdown]
194+
# ## Which encoder should I use?
196195
#
197196
# | | Meaningful order | Non-meaningful order |
198197
# | ---------------- | ----------------------------- | -------------------- |
199-
# | Tree-based model | `OrdinalEncoder` | `OrdinalEncoder` |
198+
# | Tree-based model | `OrdinalEncoder` | `OrdinalEncoder` with reasonable depth |
200199
# | Linear model | `OrdinalEncoder` with caution | `OneHotEncoder` |
200+
201+
# %% [markdown]
202+
# ```{important}
201203
#
202204
# - `OneHotEncoder`: always does something meaningful, but can be unnecessary
203205
# slow with trees.
204206
# - `OrdinalEncoder`: can be detrimental for linear models unless your category
205207
# has a meaningful order and you make sure that `OrdinalEncoder` respects this
206208
# order. Trees can deal with `OrdinalEncoder` fine as long as they are deep
207-
# enough.
209+
# enough. However, when you allow the decision tree to grow very deep, it might
210+
# overfit on other features.
208211
# ```
212+
# %% [markdown]
213+
# Next to one-hot-encoding and ordinal encoding categorical features,
214+
# scikit-learn offers the [`TargetEncoder`](https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder).
215+
# This encoder is well suited for nominal, categorical features with high
216+
# cardinality. This encoding strategy is beyond the scope of this course,
217+
# but the interested reader is encouraged to explore this encoder.

python_scripts/03_categorical_pipeline_ex_02.html

+61-2
Original file line numberDiff line numberDiff line change
@@ -695,7 +695,10 @@ <h2> Contents </h2>
695695
<ul class="visible nav section-nav flex-column">
696696
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#reference-pipeline-no-numerical-scaling-and-integer-coded-categories">Reference pipeline (no numerical scaling and integer-coded categories)</a></li>
697697
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#scaling-numerical-features">Scaling numerical features</a></li>
698-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#one-hot-encoding-of-categorical-variables">One-hot encoding of categorical variables</a></li>
698+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#one-hot-encoding-of-categorical-variables">One-hot encoding of categorical variables</a><ul class="nav section-nav flex-column">
699+
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#analysis">Analysis</a></li>
700+
</ul>
701+
</li>
699702
</ul>
700703
</nav>
701704
</div>
@@ -826,6 +829,59 @@ <h2>One-hot encoding of categorical variables<a class="headerlink" href="#one-ho
826829
</div>
827830
</div>
828831
</div>
832+
<section id="analysis">
833+
<h3>Analysis<a class="headerlink" href="#analysis" title="Link to this heading">#</a></h3>
834+
<p>From an accuracy point of view, the result is almost exactly the same. The
835+
reason is that <code class="docutils literal notranslate"><span class="pre">HistGradientBoostingClassifier</span></code> is expressive and robust
836+
enough to deal with misleading ordering of integer coded categories (which was
837+
not the case for linear models).</p>
838+
<p>However from a computation point of view, the training time is much longer:
839+
this is caused by the fact that <code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code> generates more features than
840+
<code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code>; for each unique categorical value a column is created.</p>
841+
<p>Note that the current implementation <code class="docutils literal notranslate"><span class="pre">HistGradientBoostingClassifier</span></code> is still
842+
incomplete, and once sparse representation are handled correctly, training
843+
time might improve with such kinds of encodings.</p>
844+
<p>The main take away message is that arbitrary integer coding of categories is
845+
perfectly fine for <code class="docutils literal notranslate"><span class="pre">HistGradientBoostingClassifier</span></code> and yields fast training
846+
times.</p>
847+
<p>Which encoder should I use?</p>
848+
<div class="pst-scrollable-table-container"><table class="table">
849+
<thead>
850+
<tr class="row-odd"><th class="head"><p></p></th>
851+
<th class="head"><p>Meaningful order</p></th>
852+
<th class="head"><p>Non-meaningful order</p></th>
853+
</tr>
854+
</thead>
855+
<tbody>
856+
<tr class="row-even"><td><p>Tree-based model</p></td>
857+
<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code></p></td>
858+
<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> with reasonable depth</p></td>
859+
</tr>
860+
<tr class="row-odd"><td><p>Linear model</p></td>
861+
<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> with caution</p></td>
862+
<td><p><code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code></p></td>
863+
</tr>
864+
</tbody>
865+
</table>
866+
</div>
867+
<div class="admonition important">
868+
<p class="admonition-title">Important</p>
869+
<ul class="simple">
870+
<li><p><code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code>: always does something meaningful, but can be unnecessary
871+
slow with trees.</p></li>
872+
<li><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code>: can be detrimental for linear models unless your category
873+
has a meaningful order and you make sure that <code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> respects this
874+
order. Trees can deal with <code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> fine as long as they are deep
875+
enough. However, when you allow the decision tree to grow very deep, it might
876+
overfit on other features.</p></li>
877+
</ul>
878+
</div>
879+
<p>Next to one-hot-encoding and ordinal encoding categorical features,
880+
scikit-learn offers the <a class="reference external" href="https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder"><code class="docutils literal notranslate"><span class="pre">TargetEncoder</span></code></a>.
881+
This encoder is well suited for nominal, categorical features with high
882+
cardinality. This encoding strategy is beyond the scope of this course,
883+
but the interested reader is encouraged to explore this encoder.</p>
884+
</section>
829885
</section>
830886
</section>
831887

@@ -895,7 +951,10 @@ <h2>One-hot encoding of categorical variables<a class="headerlink" href="#one-ho
895951
<ul class="visible nav section-nav flex-column">
896952
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#reference-pipeline-no-numerical-scaling-and-integer-coded-categories">Reference pipeline (no numerical scaling and integer-coded categories)</a></li>
897953
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#scaling-numerical-features">Scaling numerical features</a></li>
898-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#one-hot-encoding-of-categorical-variables">One-hot encoding of categorical variables</a></li>
954+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#one-hot-encoding-of-categorical-variables">One-hot encoding of categorical variables</a><ul class="nav section-nav flex-column">
955+
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#analysis">Analysis</a></li>
956+
</ul>
957+
</li>
899958
</ul>
900959
</nav></div>
901960

python_scripts/03_categorical_pipeline_sol_02.html

+18-8
Original file line numberDiff line numberDiff line change
@@ -702,6 +702,7 @@ <h2> Contents </h2>
702702
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#id1">Analysis</a></li>
703703
</ul>
704704
</li>
705+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#which-encoder-should-i-use">Which encoder should I use?</a></li>
705706
</ul>
706707
</nav>
707708
</div>
@@ -912,17 +913,18 @@ <h3>Analysis<a class="headerlink" href="#id1" title="Link to this heading">#</a>
912913
enough to deal with misleading ordering of integer coded categories (which was
913914
not the case for linear models).</p>
914915
<p>However from a computation point of view, the training time is much longer:
915-
this is caused by the fact that <code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code> generates approximately 10
916-
times more features than <code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code>.</p>
916+
this is caused by the fact that <code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code> generates more features than
917+
<code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code>; for each unique categorical value a column is created.</p>
917918
<p>Note that the current implementation <code class="docutils literal notranslate"><span class="pre">HistGradientBoostingClassifier</span></code> is still
918919
incomplete, and once sparse representation are handled correctly, training
919920
time might improve with such kinds of encodings.</p>
920921
<p>The main take away message is that arbitrary integer coding of categories is
921922
perfectly fine for <code class="docutils literal notranslate"><span class="pre">HistGradientBoostingClassifier</span></code> and yields fast training
922923
times.</p>
923-
<div class="admonition important">
924-
<p class="admonition-title">Important</p>
925-
<p>Which encoder should I use?</p>
924+
</section>
925+
</section>
926+
<section id="which-encoder-should-i-use">
927+
<h2>Which encoder should I use?<a class="headerlink" href="#which-encoder-should-i-use" title="Link to this heading">#</a></h2>
926928
<div class="pst-scrollable-table-container"><table class="table">
927929
<thead>
928930
<tr class="row-odd"><th class="head"><p></p></th>
@@ -933,7 +935,7 @@ <h3>Analysis<a class="headerlink" href="#id1" title="Link to this heading">#</a>
933935
<tbody>
934936
<tr class="row-even"><td><p>Tree-based model</p></td>
935937
<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code></p></td>
936-
<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code></p></td>
938+
<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> with reasonable depth</p></td>
937939
</tr>
938940
<tr class="row-odd"><td><p>Linear model</p></td>
939941
<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> with caution</p></td>
@@ -942,16 +944,23 @@ <h3>Analysis<a class="headerlink" href="#id1" title="Link to this heading">#</a>
942944
</tbody>
943945
</table>
944946
</div>
947+
<div class="admonition important">
948+
<p class="admonition-title">Important</p>
945949
<ul class="simple">
946950
<li><p><code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code>: always does something meaningful, but can be unnecessary
947951
slow with trees.</p></li>
948952
<li><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code>: can be detrimental for linear models unless your category
949953
has a meaningful order and you make sure that <code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> respects this
950954
order. Trees can deal with <code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> fine as long as they are deep
951-
enough.</p></li>
955+
enough. However, when you allow the decision tree to grow very deep, it might
956+
overfit on other features.</p></li>
952957
</ul>
953958
</div>
954-
</section>
959+
<p>Next to one-hot-encoding and ordinal encoding categorical features,
960+
scikit-learn offers the <a class="reference external" href="https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder"><code class="docutils literal notranslate"><span class="pre">TargetEncoder</span></code></a>.
961+
This encoder is well suited for nominal, categorical features with high
962+
cardinality. This encoding strategy is beyond the scope of this course,
963+
but the interested reader is encouraged to explore this encoder.</p>
955964
</section>
956965
</section>
957966

@@ -1028,6 +1037,7 @@ <h3>Analysis<a class="headerlink" href="#id1" title="Link to this heading">#</a>
10281037
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#id1">Analysis</a></li>
10291038
</ul>
10301039
</li>
1040+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#which-encoder-should-i-use">Which encoder should I use?</a></li>
10311041
</ul>
10321042
</nav></div>
10331043

searchindex.js

+1-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)