You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<h3>Analysis<aclass="headerlink" href="#analysis" title="Link to this heading">#</a></h3>
834
+
<p>From an accuracy point of view, the result is almost exactly the same. The
835
+
reason is that <codeclass="docutils literal notranslate"><spanclass="pre">HistGradientBoostingClassifier</span></code> is expressive and robust
836
+
enough to deal with misleading ordering of integer coded categories (which was
837
+
not the case for linear models).</p>
838
+
<p>However from a computation point of view, the training time is much longer:
839
+
this is caused by the fact that <codeclass="docutils literal notranslate"><spanclass="pre">OneHotEncoder</span></code> generates more features than
840
+
<codeclass="docutils literal notranslate"><spanclass="pre">OrdinalEncoder</span></code>; for each unique categorical value a column is created.</p>
841
+
<p>Note that the current implementation <codeclass="docutils literal notranslate"><spanclass="pre">HistGradientBoostingClassifier</span></code> is still
842
+
incomplete, and once sparse representation are handled correctly, training
843
+
time might improve with such kinds of encodings.</p>
844
+
<p>The main take away message is that arbitrary integer coding of categories is
845
+
perfectly fine for <codeclass="docutils literal notranslate"><spanclass="pre">HistGradientBoostingClassifier</span></code> and yields fast training
<li><p><codeclass="docutils literal notranslate"><spanclass="pre">OneHotEncoder</span></code>: always does something meaningful, but can be unnecessary
871
+
slow with trees.</p></li>
872
+
<li><p><codeclass="docutils literal notranslate"><spanclass="pre">OrdinalEncoder</span></code>: can be detrimental for linear models unless your category
873
+
has a meaningful order and you make sure that <codeclass="docutils literal notranslate"><spanclass="pre">OrdinalEncoder</span></code> respects this
874
+
order. Trees can deal with <codeclass="docutils literal notranslate"><spanclass="pre">OrdinalEncoder</span></code> fine as long as they are deep
875
+
enough. However, when you allow the decision tree to grow very deep, it might
876
+
overfit on other features.</p></li>
877
+
</ul>
878
+
</div>
879
+
<p>Next to one-hot-encoding and ordinal encoding categorical features,
880
+
scikit-learn offers the <aclass="reference external" href="https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder"><codeclass="docutils literal notranslate"><spanclass="pre">TargetEncoder</span></code></a>.
881
+
This encoder is well suited for nominal, categorical features with high
882
+
cardinality. This encoding strategy is beyond the scope of this course,
883
+
but the interested reader is encouraged to explore this encoder.</p>
<liclass="toc-h2 nav-item toc-entry"><aclass="reference internal nav-link" href="#which-encoder-should-i-use">Which encoder should I use?</a></li>
705
706
</ul>
706
707
</nav>
707
708
</div>
@@ -912,17 +913,18 @@ <h3>Analysis<a class="headerlink" href="#id1" title="Link to this heading">#</a>
912
913
enough to deal with misleading ordering of integer coded categories (which was
913
914
not the case for linear models).</p>
914
915
<p>However from a computation point of view, the training time is much longer:
915
-
this is caused by the fact that <codeclass="docutils literal notranslate"><spanclass="pre">OneHotEncoder</span></code> generates approximately 10
916
-
times more features than <codeclass="docutils literal notranslate"><spanclass="pre">OrdinalEncoder</span></code>.</p>
916
+
this is caused by the fact that <codeclass="docutils literal notranslate"><spanclass="pre">OneHotEncoder</span></code> generates more features than
917
+
<codeclass="docutils literal notranslate"><spanclass="pre">OrdinalEncoder</span></code>; for each unique categorical value a column is created.</p>
917
918
<p>Note that the current implementation <codeclass="docutils literal notranslate"><spanclass="pre">HistGradientBoostingClassifier</span></code> is still
918
919
incomplete, and once sparse representation are handled correctly, training
919
920
time might improve with such kinds of encodings.</p>
920
921
<p>The main take away message is that arbitrary integer coding of categories is
921
922
perfectly fine for <codeclass="docutils literal notranslate"><spanclass="pre">HistGradientBoostingClassifier</span></code> and yields fast training
922
923
times.</p>
923
-
<divclass="admonition important">
924
-
<pclass="admonition-title">Important</p>
925
-
<p>Which encoder should I use?</p>
924
+
</section>
925
+
</section>
926
+
<sectionid="which-encoder-should-i-use">
927
+
<h2>Which encoder should I use?<aclass="headerlink" href="#which-encoder-should-i-use" title="Link to this heading">#</a></h2>
<td><p><codeclass="docutils literal notranslate"><spanclass="pre">OrdinalEncoder</span></code> with reasonable depth</p></td>
937
939
</tr>
938
940
<trclass="row-odd"><td><p>Linear model</p></td>
939
941
<td><p><codeclass="docutils literal notranslate"><spanclass="pre">OrdinalEncoder</span></code> with caution</p></td>
@@ -942,16 +944,23 @@ <h3>Analysis<a class="headerlink" href="#id1" title="Link to this heading">#</a>
942
944
</tbody>
943
945
</table>
944
946
</div>
947
+
<divclass="admonition important">
948
+
<pclass="admonition-title">Important</p>
945
949
<ulclass="simple">
946
950
<li><p><codeclass="docutils literal notranslate"><spanclass="pre">OneHotEncoder</span></code>: always does something meaningful, but can be unnecessary
947
951
slow with trees.</p></li>
948
952
<li><p><codeclass="docutils literal notranslate"><spanclass="pre">OrdinalEncoder</span></code>: can be detrimental for linear models unless your category
949
953
has a meaningful order and you make sure that <codeclass="docutils literal notranslate"><spanclass="pre">OrdinalEncoder</span></code> respects this
950
954
order. Trees can deal with <codeclass="docutils literal notranslate"><spanclass="pre">OrdinalEncoder</span></code> fine as long as they are deep
951
-
enough.</p></li>
955
+
enough. However, when you allow the decision tree to grow very deep, it might
956
+
overfit on other features.</p></li>
952
957
</ul>
953
958
</div>
954
-
</section>
959
+
<p>Next to one-hot-encoding and ordinal encoding categorical features,
960
+
scikit-learn offers the <aclass="reference external" href="https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder"><codeclass="docutils literal notranslate"><spanclass="pre">TargetEncoder</span></code></a>.
961
+
This encoder is well suited for nominal, categorical features with high
962
+
cardinality. This encoding strategy is beyond the scope of this course,
963
+
but the interested reader is encouraged to explore this encoder.</p>
955
964
</section>
956
965
</section>
957
966
@@ -1028,6 +1037,7 @@ <h3>Analysis<a class="headerlink" href="#id1" title="Link to this heading">#</a>
0 commit comments