[ci skip] Add tip to read about TargetEncoder in the documentation (#811) 66af6c1

ArturoAmorQ · ArturoAmorQ · commit 81dbe3348438 · 2025-04-09T08:04:36.000Z
diff --git a/_sources/python_scripts/03_categorical_pipeline_ex_02.py b/_sources/python_scripts/03_categorical_pipeline_ex_02.py
@@ -110,3 +110,48 @@
 
 # %%
 # Write your code here.
+
+# %% [markdown]
+# ### Analysis
+#
+# From an accuracy point of view, the result is almost exactly the same. The
+# reason is that `HistGradientBoostingClassifier` is expressive and robust
+# enough to deal with misleading ordering of integer coded categories (which was
+# not the case for linear models).
+#
+# However from a computation point of view, the training time is much longer:
+# this is caused by the fact that `OneHotEncoder` generates more features than
+# `OrdinalEncoder`; for each unique categorical value a column is created.
+#
+# Note that the current implementation `HistGradientBoostingClassifier` is still
+# incomplete, and once sparse representation are handled correctly, training
+# time might improve with such kinds of encodings.
+#
+# The main take away message is that arbitrary integer coding of categories is
+# perfectly fine for `HistGradientBoostingClassifier` and yields fast training
+# times.
+
+# Which encoder should I use?
+#
+# |                  | Meaningful order              | Non-meaningful order |
+# | ---------------- | ----------------------------- | -------------------- |
+# | Tree-based model | `OrdinalEncoder`              | `OrdinalEncoder` with reasonable depth    |
+# | Linear model     | `OrdinalEncoder` with caution | `OneHotEncoder`      |
+
+# %% [markdown]
+# ```{important}
+#
+# - `OneHotEncoder`: always does something meaningful, but can be unnecessary
+#   slow with trees.
+# - `OrdinalEncoder`: can be detrimental for linear models unless your category
+#   has a meaningful order and you make sure that `OrdinalEncoder` respects this
+#   order. Trees can deal with `OrdinalEncoder` fine as long as they are deep
+#   enough. However, when you allow the decision tree to grow very deep, it might
+#   overfit on other features.
+# ```
+# %% [markdown]
+# Next to one-hot-encoding and ordinal encoding categorical features,
+# scikit-learn offers the [`TargetEncoder`](https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder).
+# This encoder is well suited for nominal, categorical features with high
+# cardinality. This encoding strategy is beyond the scope of this course,
+# but the interested reader is encouraged to explore this encoder.
diff --git a/_sources/python_scripts/03_categorical_pipeline_sol_02.py b/_sources/python_scripts/03_categorical_pipeline_sol_02.py
@@ -179,8 +179,8 @@
 # not the case for linear models).
 #
 # However from a computation point of view, the training time is much longer:
-# this is caused by the fact that `OneHotEncoder` generates approximately 10
-# times more features than `OrdinalEncoder`.
+# this is caused by the fact that `OneHotEncoder` generates more features than
+# `OrdinalEncoder`; for each unique categorical value a column is created.
 #
 # Note that the current implementation `HistGradientBoostingClassifier` is still
 # incomplete, and once sparse representation are handled correctly, training
@@ -190,19 +190,28 @@
 # perfectly fine for `HistGradientBoostingClassifier` and yields fast training
 # times.
 
-# %% [markdown] tags=["solution"]
-# ```{important}
-# Which encoder should I use?
+# %% [markdown]
+# ## Which encoder should I use?
 #
 # |                  | Meaningful order              | Non-meaningful order |
 # | ---------------- | ----------------------------- | -------------------- |
-# | Tree-based model | `OrdinalEncoder`              | `OrdinalEncoder`     |
+# | Tree-based model | `OrdinalEncoder`              | `OrdinalEncoder` with reasonable depth    |
 # | Linear model     | `OrdinalEncoder` with caution | `OneHotEncoder`      |
+
+# %% [markdown]
+# ```{important}
 #
 # - `OneHotEncoder`: always does something meaningful, but can be unnecessary
 #   slow with trees.
 # - `OrdinalEncoder`: can be detrimental for linear models unless your category
 #   has a meaningful order and you make sure that `OrdinalEncoder` respects this
 #   order. Trees can deal with `OrdinalEncoder` fine as long as they are deep
-#   enough.
+#   enough. However, when you allow the decision tree to grow very deep, it might
+#   overfit on other features.
 # ```
+# %% [markdown]
+# Next to one-hot-encoding and ordinal encoding categorical features,
+# scikit-learn offers the [`TargetEncoder`](https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder).
+# This encoder is well suited for nominal, categorical features with high
+# cardinality. This encoding strategy is beyond the scope of this course,
+# but the interested reader is encouraged to explore this encoder.
diff --git a/python_scripts/03_categorical_pipeline_ex_02.html b/python_scripts/03_categorical_pipeline_ex_02.html
@@ -695,7 +695,10 @@ <h2> Contents </h2>
                 <ul class="visible nav section-nav flex-column">
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#reference-pipeline-no-numerical-scaling-and-integer-coded-categories">Reference pipeline (no numerical scaling and integer-coded categories)</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#scaling-numerical-features">Scaling numerical features</a></li>
-<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#one-hot-encoding-of-categorical-variables">One-hot encoding of categorical variables</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#one-hot-encoding-of-categorical-variables">One-hot encoding of categorical variables</a><ul class="nav section-nav flex-column">
+<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#analysis">Analysis</a></li>
+</ul>
+</li>
 </ul>
             </nav>
         </div>
@@ -826,6 +829,59 @@ <h2>One-hot encoding of categorical variables<a class="headerlink" href="#one-ho
 </div>
 </div>
 </div>
+<section id="analysis">
+<h3>Analysis<a class="headerlink" href="#analysis" title="Link to this heading">#</a></h3>
+<p>From an accuracy point of view, the result is almost exactly the same. The
+reason is that <code class="docutils literal notranslate"><span class="pre">HistGradientBoostingClassifier</span></code> is expressive and robust
+enough to deal with misleading ordering of integer coded categories (which was
+not the case for linear models).</p>
+<p>However from a computation point of view, the training time is much longer:
+this is caused by the fact that <code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code> generates more features than
+<code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code>; for each unique categorical value a column is created.</p>
+<p>Note that the current implementation <code class="docutils literal notranslate"><span class="pre">HistGradientBoostingClassifier</span></code> is still
+incomplete, and once sparse representation are handled correctly, training
+time might improve with such kinds of encodings.</p>
+<p>The main take away message is that arbitrary integer coding of categories is
+perfectly fine for <code class="docutils literal notranslate"><span class="pre">HistGradientBoostingClassifier</span></code> and yields fast training
+times.</p>
+<p>Which encoder should I use?</p>
+<div class="pst-scrollable-table-container"><table class="table">
+<thead>
+<tr class="row-odd"><th class="head"><p></p></th>
+<th class="head"><p>Meaningful order</p></th>
+<th class="head"><p>Non-meaningful order</p></th>
+</tr>
+</thead>
+<tbody>
+<tr class="row-even"><td><p>Tree-based model</p></td>
+<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code></p></td>
+<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> with reasonable depth</p></td>
+</tr>
+<tr class="row-odd"><td><p>Linear model</p></td>
+<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> with caution</p></td>
+<td><p><code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code></p></td>
+</tr>
+</tbody>
+</table>
+</div>
+<div class="admonition important">
+<p class="admonition-title">Important</p>
+<ul class="simple">
+<li><p><code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code>: always does something meaningful, but can be unnecessary
+slow with trees.</p></li>
+<li><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code>: can be detrimental for linear models unless your category
+has a meaningful order and you make sure that <code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> respects this
+order. Trees can deal with <code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> fine as long as they are deep
+enough. However, when you allow the decision tree to grow very deep, it might
+overfit on other features.</p></li>
+</ul>
+</div>
+<p>Next to one-hot-encoding and ordinal encoding categorical features,
+scikit-learn offers the <a class="reference external" href="https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder"><code class="docutils literal notranslate"><span class="pre">TargetEncoder</span></code></a>.
+This encoder is well suited for nominal, categorical features with high
+cardinality. This encoding strategy is beyond the scope of this course,
+but the interested reader is encouraged to explore this encoder.</p>
+</section>
 </section>
 </section>
 
@@ -895,7 +951,10 @@ <h2>One-hot encoding of categorical variables<a class="headerlink" href="#one-ho
     <ul class="visible nav section-nav flex-column">
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#reference-pipeline-no-numerical-scaling-and-integer-coded-categories">Reference pipeline (no numerical scaling and integer-coded categories)</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#scaling-numerical-features">Scaling numerical features</a></li>
-<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#one-hot-encoding-of-categorical-variables">One-hot encoding of categorical variables</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#one-hot-encoding-of-categorical-variables">One-hot encoding of categorical variables</a><ul class="nav section-nav flex-column">
+<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#analysis">Analysis</a></li>
+</ul>
+</li>
 </ul>
   </nav></div>
 
diff --git a/python_scripts/03_categorical_pipeline_sol_02.html b/python_scripts/03_categorical_pipeline_sol_02.html
@@ -702,6 +702,7 @@ <h2> Contents </h2>
 <li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#id1">Analysis</a></li>
 </ul>
 </li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#which-encoder-should-i-use">Which encoder should I use?</a></li>
 </ul>
             </nav>
         </div>
@@ -912,17 +913,18 @@ <h3>Analysis<a class="headerlink" href="#id1" title="Link to this heading">#</a>
 enough to deal with misleading ordering of integer coded categories (which was
 not the case for linear models).</p>
 <p>However from a computation point of view, the training time is much longer:
-this is caused by the fact that <code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code> generates approximately 10
-times more features than <code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code>.</p>
+this is caused by the fact that <code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code> generates more features than
+<code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code>; for each unique categorical value a column is created.</p>
 <p>Note that the current implementation <code class="docutils literal notranslate"><span class="pre">HistGradientBoostingClassifier</span></code> is still
 incomplete, and once sparse representation are handled correctly, training
 time might improve with such kinds of encodings.</p>
 <p>The main take away message is that arbitrary integer coding of categories is
 perfectly fine for <code class="docutils literal notranslate"><span class="pre">HistGradientBoostingClassifier</span></code> and yields fast training
 times.</p>
-<div class="admonition important">
-<p class="admonition-title">Important</p>
-<p>Which encoder should I use?</p>
+</section>
+</section>
+<section id="which-encoder-should-i-use">
+<h2>Which encoder should I use?<a class="headerlink" href="#which-encoder-should-i-use" title="Link to this heading">#</a></h2>
 <div class="pst-scrollable-table-container"><table class="table">
 <thead>
 <tr class="row-odd"><th class="head"><p></p></th>
@@ -933,7 +935,7 @@ <h3>Analysis<a class="headerlink" href="#id1" title="Link to this heading">#</a>
 <tbody>
 <tr class="row-even"><td><p>Tree-based model</p></td>
 <td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code></p></td>
-<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code></p></td>
+<td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> with reasonable depth</p></td>
 </tr>
 <tr class="row-odd"><td><p>Linear model</p></td>
 <td><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> with caution</p></td>
@@ -942,16 +944,23 @@ <h3>Analysis<a class="headerlink" href="#id1" title="Link to this heading">#</a>
 </tbody>
 </table>
 </div>
+<div class="admonition important">
+<p class="admonition-title">Important</p>
 <ul class="simple">
 <li><p><code class="docutils literal notranslate"><span class="pre">OneHotEncoder</span></code>: always does something meaningful, but can be unnecessary
 slow with trees.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code>: can be detrimental for linear models unless your category
 has a meaningful order and you make sure that <code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> respects this
 order. Trees can deal with <code class="docutils literal notranslate"><span class="pre">OrdinalEncoder</span></code> fine as long as they are deep
-enough.</p></li>
+enough. However, when you allow the decision tree to grow very deep, it might
+overfit on other features.</p></li>
 </ul>
 </div>
-</section>
+<p>Next to one-hot-encoding and ordinal encoding categorical features,
+scikit-learn offers the <a class="reference external" href="https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder"><code class="docutils literal notranslate"><span class="pre">TargetEncoder</span></code></a>.
+This encoder is well suited for nominal, categorical features with high
+cardinality. This encoding strategy is beyond the scope of this course,
+but the interested reader is encouraged to explore this encoder.</p>
 </section>
 </section>
 
@@ -1028,6 +1037,7 @@ <h3>Analysis<a class="headerlink" href="#id1" title="Link to this heading">#</a>
 <li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#id1">Analysis</a></li>
 </ul>
 </li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#which-encoder-should-i-use">Which encoder should I use?</a></li>
 </ul>
   </nav></div>
 
diff --git a/searchindex.js b/searchindex.js