Skip to content

Commit 6963515

Browse files
committed
[ci skip] Add validation curve for n_estimators plus advise on (not) tuning (#827) 425071e
1 parent 81dbe33 commit 6963515

9 files changed

+85
-18
lines changed
Loading
Binary file not shown.
Loading

_sources/python_scripts/ensemble_ex_03.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,9 @@
5858
# For both the gradient-boosting and random forest models, create a validation
5959
# curve using the training set to assess the impact of the number of trees on
6060
# the performance of each model. Evaluate the list of parameters `param_range =
61-
# np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error.
61+
# np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using
62+
# `neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the
63+
# right sign of the Mean Absolute Error.
6264

6365
# %%
6466
# Write your code here.

_sources/python_scripts/ensemble_sol_03.py

+35-5
Original file line numberDiff line numberDiff line change
@@ -58,15 +58,17 @@
5858
# For both the gradient-boosting and random forest models, create a validation
5959
# curve using the training set to assess the impact of the number of trees on
6060
# the performance of each model. Evaluate the list of parameters `param_range =
61-
# np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error.
61+
# np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using
62+
# `neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the
63+
# right sign of the Mean Absolute Error.
6264

6365
# %%
6466
# solution
6567
import numpy as np
6668

6769
from sklearn.model_selection import ValidationCurveDisplay
6870

69-
param_range = np.array([1, 2, 5, 10, 20, 50, 100])
71+
param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200])
7072
disp = ValidationCurveDisplay.from_estimator(
7173
forest,
7274
data_train,
@@ -90,6 +92,30 @@
9092
# ensemble. However, the scores reach a plateau where adding new trees just
9193
# makes fitting and scoring slower.
9294
#
95+
# Now repeat the analysis for the gradient boosting model.
96+
97+
# %%
98+
# solution
99+
disp = ValidationCurveDisplay.from_estimator(
100+
gbdt,
101+
data_train,
102+
target_train,
103+
param_name="n_estimators",
104+
param_range=param_range,
105+
scoring="neg_mean_absolute_error",
106+
negate_score=True,
107+
std_display_style="errorbar",
108+
n_jobs=2,
109+
)
110+
111+
_ = disp.ax_.set(
112+
xlabel="Number of trees in the gradient boosting model",
113+
ylabel="Mean absolute error (k$)",
114+
title="Validation curve for gradient boosting model",
115+
)
116+
117+
118+
# %% [markdown]
93119
# Gradient boosting models overfit when the number of trees is too large. To
94120
# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
95121
# offers an early-stopping option. Internally, the algorithm uses an
@@ -98,9 +124,9 @@
98124
# improving for several iterations, it stops adding trees.
99125
#
100126
# Now, create a gradient-boosting model with `n_estimators=1_000`. This number
101-
# of trees is certainly too large. Change the parameter `n_iter_no_change`
102-
# such that the gradient boosting fitting stops after adding 5 trees to avoid
103-
# deterioration of the overall generalization performance.
127+
# of trees is certainly too large as we have seen above. Change the parameter
128+
# `n_iter_no_change` such that the gradient boosting fitting stops after adding
129+
# 5 trees to avoid deterioration of the overall generalization performance.
104130

105131
# %%
106132
# solution
@@ -113,6 +139,10 @@
113139
# dataset. Training the gradient boosting model with the entire 1000 trees would
114140
# have been detrimental.
115141

142+
# Please note that one should not hyperparameter tune the number of estimators
143+
# for both random forest and gradient boosting models. In this exercise we only
144+
# show model performance with varying `n_estimators` for educational purposes.
145+
116146
# %% [markdown]
117147
# Estimate the generalization performance of this model again using the
118148
# `sklearn.metrics.mean_absolute_error` metric but this time using the test set

appendix/notebook_timings.html

+2-2
Original file line numberDiff line numberDiff line change
@@ -971,9 +971,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Link t
971971
<td><p></p></td>
972972
</tr>
973973
<tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/ensemble_sol_03.html"><span class="doc">python_scripts/ensemble_sol_03</span></a></p></td>
974-
<td><p>2025-04-03 12:41</p></td>
974+
<td><p>2025-04-11 14:41</p></td>
975975
<td><p>cache</p></td>
976-
<td><p>40.34</p></td>
976+
<td><p>93.91</p></td>
977977
<td><p></p></td>
978978
</tr>
979979
<tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/ensemble_sol_04.html"><span class="doc">python_scripts/ensemble_sol_04</span></a></p></td>

python_scripts/ensemble_ex_03.html

+3-1
Original file line numberDiff line numberDiff line change
@@ -743,7 +743,9 @@ <h1>📝 Exercise M6.03<a class="headerlink" href="#exercise-m6-03" title="Link
743743
</div>
744744
<p>For both the gradient-boosting and random forest models, create a validation
745745
curve using the training set to assess the impact of the number of trees on
746-
the performance of each model. Evaluate the list of parameters <code class="docutils literal notranslate"><span class="pre">param_range</span> <span class="pre">=</span> <span class="pre">np.array([1,</span> <span class="pre">2,</span> <span class="pre">5,</span> <span class="pre">10,</span> <span class="pre">20,</span> <span class="pre">50,</span> <span class="pre">100])</span></code> and use the mean absolute error.</p>
746+
the performance of each model. Evaluate the list of parameters <code class="docutils literal notranslate"><span class="pre">param_range</span> <span class="pre">=</span> <span class="pre">np.array([1,</span> <span class="pre">2,</span> <span class="pre">5,</span> <span class="pre">10,</span> <span class="pre">20,</span> <span class="pre">50,</span> <span class="pre">100,</span> <span class="pre">200])</span></code> and score it using
747+
<code class="docutils literal notranslate"><span class="pre">neg_mean_absolute_error</span></code>. Remember to set <code class="docutils literal notranslate"><span class="pre">negate_score=True</span></code> to recover the
748+
right sign of the Mean Absolute Error.</p>
747749
<div class="cell docutils container">
748750
<div class="cell_input docutils container">
749751
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># Write your code here.</span>

python_scripts/ensemble_sol_03.html

+41-8
Original file line numberDiff line numberDiff line change
@@ -749,15 +749,17 @@ <h1>📃 Solution for Exercise M6.03<a class="headerlink" href="#solution-for-ex
749749
</div>
750750
<p>For both the gradient-boosting and random forest models, create a validation
751751
curve using the training set to assess the impact of the number of trees on
752-
the performance of each model. Evaluate the list of parameters <code class="docutils literal notranslate"><span class="pre">param_range</span> <span class="pre">=</span> <span class="pre">np.array([1,</span> <span class="pre">2,</span> <span class="pre">5,</span> <span class="pre">10,</span> <span class="pre">20,</span> <span class="pre">50,</span> <span class="pre">100])</span></code> and use the mean absolute error.</p>
752+
the performance of each model. Evaluate the list of parameters <code class="docutils literal notranslate"><span class="pre">param_range</span> <span class="pre">=</span> <span class="pre">np.array([1,</span> <span class="pre">2,</span> <span class="pre">5,</span> <span class="pre">10,</span> <span class="pre">20,</span> <span class="pre">50,</span> <span class="pre">100,</span> <span class="pre">200])</span></code> and score it using
753+
<code class="docutils literal notranslate"><span class="pre">neg_mean_absolute_error</span></code>. Remember to set <code class="docutils literal notranslate"><span class="pre">negate_score=True</span></code> to recover the
754+
right sign of the Mean Absolute Error.</p>
753755
<div class="cell docutils container">
754756
<div class="cell_input docutils container">
755757
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># solution</span>
756758
<span class="kn">import</span><span class="w"> </span><span class="nn">numpy</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">np</span>
757759

758760
<span class="kn">from</span><span class="w"> </span><span class="nn">sklearn.model_selection</span><span class="w"> </span><span class="kn">import</span> <span class="n">ValidationCurveDisplay</span>
759761

760-
<span class="n">param_range</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">100</span><span class="p">])</span>
762+
<span class="n">param_range</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">200</span><span class="p">])</span>
761763
<span class="n">disp</span> <span class="o">=</span> <span class="n">ValidationCurveDisplay</span><span class="o">.</span><span class="n">from_estimator</span><span class="p">(</span>
762764
<span class="n">forest</span><span class="p">,</span>
763765
<span class="n">data_train</span><span class="p">,</span>
@@ -779,22 +781,50 @@ <h1>📃 Solution for Exercise M6.03<a class="headerlink" href="#solution-for-ex
779781
</div>
780782
</div>
781783
<div class="cell_output docutils container">
782-
<img alt="../_images/d40f1bba7734b7b618e54b5a26e5fcc6ae5fdc171e901bc5dcb1d7b2fec8ca80.png" src="../_images/d40f1bba7734b7b618e54b5a26e5fcc6ae5fdc171e901bc5dcb1d7b2fec8ca80.png" />
784+
<img alt="../_images/c0d439d9824518c38aa0fbfc44afc4d92108a2665a643e98ce594f85b167ec70.png" src="../_images/c0d439d9824518c38aa0fbfc44afc4d92108a2665a643e98ce594f85b167ec70.png" />
783785
</div>
784786
</div>
785787
<p>Random forest models improve when increasing the number of trees in the
786788
ensemble. However, the scores reach a plateau where adding new trees just
787789
makes fitting and scoring slower.</p>
790+
<p>Now repeat the analysis for the gradient boosting model.</p>
791+
<div class="cell docutils container">
792+
<div class="cell_input docutils container">
793+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># solution</span>
794+
<span class="n">disp</span> <span class="o">=</span> <span class="n">ValidationCurveDisplay</span><span class="o">.</span><span class="n">from_estimator</span><span class="p">(</span>
795+
<span class="n">gbdt</span><span class="p">,</span>
796+
<span class="n">data_train</span><span class="p">,</span>
797+
<span class="n">target_train</span><span class="p">,</span>
798+
<span class="n">param_name</span><span class="o">=</span><span class="s2">&quot;n_estimators&quot;</span><span class="p">,</span>
799+
<span class="n">param_range</span><span class="o">=</span><span class="n">param_range</span><span class="p">,</span>
800+
<span class="n">scoring</span><span class="o">=</span><span class="s2">&quot;neg_mean_absolute_error&quot;</span><span class="p">,</span>
801+
<span class="n">negate_score</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
802+
<span class="n">std_display_style</span><span class="o">=</span><span class="s2">&quot;errorbar&quot;</span><span class="p">,</span>
803+
<span class="n">n_jobs</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
804+
<span class="p">)</span>
805+
806+
<span class="n">_</span> <span class="o">=</span> <span class="n">disp</span><span class="o">.</span><span class="n">ax_</span><span class="o">.</span><span class="n">set</span><span class="p">(</span>
807+
<span class="n">xlabel</span><span class="o">=</span><span class="s2">&quot;Number of trees in the gradient boosting model&quot;</span><span class="p">,</span>
808+
<span class="n">ylabel</span><span class="o">=</span><span class="s2">&quot;Mean absolute error (k$)&quot;</span><span class="p">,</span>
809+
<span class="n">title</span><span class="o">=</span><span class="s2">&quot;Validation curve for gradient boosting model&quot;</span><span class="p">,</span>
810+
<span class="p">)</span>
811+
</pre></div>
812+
</div>
813+
</div>
814+
<div class="cell_output docutils container">
815+
<img alt="../_images/f19dd2a1686cf4790940f70d0d44c0799a7dd4abb822136b914c2c87dddedd71.png" src="../_images/f19dd2a1686cf4790940f70d0d44c0799a7dd4abb822136b914c2c87dddedd71.png" />
816+
</div>
817+
</div>
788818
<p>Gradient boosting models overfit when the number of trees is too large. To
789819
avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
790820
offers an early-stopping option. Internally, the algorithm uses an
791821
out-of-sample set to compute the generalization performance of the model at
792822
each addition of a tree. Thus, if the generalization performance is not
793823
improving for several iterations, it stops adding trees.</p>
794824
<p>Now, create a gradient-boosting model with <code class="docutils literal notranslate"><span class="pre">n_estimators=1_000</span></code>. This number
795-
of trees is certainly too large. Change the parameter <code class="docutils literal notranslate"><span class="pre">n_iter_no_change</span></code>
796-
such that the gradient boosting fitting stops after adding 5 trees to avoid
797-
deterioration of the overall generalization performance.</p>
825+
of trees is certainly too large as we have seen above. Change the parameter
826+
<code class="docutils literal notranslate"><span class="pre">n_iter_no_change</span></code> such that the gradient boosting fitting stops after adding
827+
5 trees to avoid deterioration of the overall generalization performance.</p>
798828
<div class="cell docutils container">
799829
<div class="cell_input docutils container">
800830
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="c1"># solution</span>
@@ -805,14 +835,17 @@ <h1>📃 Solution for Exercise M6.03<a class="headerlink" href="#solution-for-ex
805835
</div>
806836
</div>
807837
<div class="cell_output docutils container">
808-
<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>211
838+
<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>113
809839
</pre></div>
810840
</div>
811841
</div>
812842
</div>
813843
<p>We see that the number of trees used is far below 1000 with the current
814844
dataset. Training the gradient boosting model with the entire 1000 trees would
815845
have been detrimental.</p>
846+
<p>Please note that one should not hyperparameter tune the number of estimators
847+
for both random forest and gradient boosting models. In this exercise we only
848+
show model performance with varying <code class="docutils literal notranslate"><span class="pre">n_estimators</span></code> for educational purposes.</p>
816849
<p>Estimate the generalization performance of this model again using the
817850
<code class="docutils literal notranslate"><span class="pre">sklearn.metrics.mean_absolute_error</span></code> metric but this time using the test set
818851
that we held out at the beginning of the notebook. Compare the resulting value
@@ -828,7 +861,7 @@ <h1>📃 Solution for Exercise M6.03<a class="headerlink" href="#solution-for-ex
828861
</div>
829862
</div>
830863
<div class="cell_output docutils container">
831-
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>On average, our GBDT regressor makes an error of 34.93 k$
864+
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>On average, our GBDT regressor makes an error of 36.93 k$
832865
</pre></div>
833866
</div>
834867
</div>

searchindex.js

+1-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)