Skip to content

Commit 425071e

Browse files
authored
Add validation curve for n_estimators plus advise on (not) tuning (#827)
1 parent 66af6c1 commit 425071e

File tree

4 files changed

+88
-13
lines changed

4 files changed

+88
-13
lines changed

notebooks/ensemble_ex_03.ipynb

+3-1
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,9 @@
8585
"For both the gradient-boosting and random forest models, create a validation\n",
8686
"curve using the training set to assess the impact of the number of trees on\n",
8787
"the performance of each model. Evaluate the list of parameters `param_range =\n",
88-
"np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error."
88+
"np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using\n",
89+
"`neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the\n",
90+
"right sign of the Mean Absolute Error."
8991
]
9092
},
9193
{

notebooks/ensemble_sol_03.ipynb

+47-6
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,9 @@
9191
"For both the gradient-boosting and random forest models, create a validation\n",
9292
"curve using the training set to assess the impact of the number of trees on\n",
9393
"the performance of each model. Evaluate the list of parameters `param_range =\n",
94-
"np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error."
94+
"np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using\n",
95+
"`neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the\n",
96+
"right sign of the Mean Absolute Error."
9597
]
9698
},
9799
{
@@ -105,7 +107,7 @@
105107
"\n",
106108
"from sklearn.model_selection import ValidationCurveDisplay\n",
107109
"\n",
108-
"param_range = np.array([1, 2, 5, 10, 20, 50, 100])\n",
110+
"param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200])\n",
109111
"disp = ValidationCurveDisplay.from_estimator(\n",
110112
" forest,\n",
111113
" data_train,\n",
@@ -133,6 +135,41 @@
133135
"ensemble. However, the scores reach a plateau where adding new trees just\n",
134136
"makes fitting and scoring slower.\n",
135137
"\n",
138+
"Now repeat the analysis for the gradient boosting model."
139+
]
140+
},
141+
{
142+
"cell_type": "code",
143+
"execution_count": null,
144+
"metadata": {
145+
"lines_to_next_cell": 2
146+
},
147+
"outputs": [],
148+
"source": [
149+
"# solution\n",
150+
"disp = ValidationCurveDisplay.from_estimator(\n",
151+
" gbdt,\n",
152+
" data_train,\n",
153+
" target_train,\n",
154+
" param_name=\"n_estimators\",\n",
155+
" param_range=param_range,\n",
156+
" scoring=\"neg_mean_absolute_error\",\n",
157+
" negate_score=True,\n",
158+
" std_display_style=\"errorbar\",\n",
159+
" n_jobs=2,\n",
160+
")\n",
161+
"\n",
162+
"_ = disp.ax_.set(\n",
163+
" xlabel=\"Number of trees in the gradient boosting model\",\n",
164+
" ylabel=\"Mean absolute error (k$)\",\n",
165+
" title=\"Validation curve for gradient boosting model\",\n",
166+
")"
167+
]
168+
},
169+
{
170+
"cell_type": "markdown",
171+
"metadata": {},
172+
"source": [
136173
"Gradient boosting models overfit when the number of trees is too large. To\n",
137174
"avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
138175
"offers an early-stopping option. Internally, the algorithm uses an\n",
@@ -141,9 +178,9 @@
141178
"improving for several iterations, it stops adding trees.\n",
142179
"\n",
143180
"Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
144-
"of trees is certainly too large. Change the parameter `n_iter_no_change`\n",
145-
"such that the gradient boosting fitting stops after adding 5 trees to avoid\n",
146-
"deterioration of the overall generalization performance."
181+
"of trees is certainly too large as we have seen above. Change the parameter\n",
182+
"`n_iter_no_change` such that the gradient boosting fitting stops after adding\n",
183+
"5 trees to avoid deterioration of the overall generalization performance."
147184
]
148185
},
149186
{
@@ -168,7 +205,11 @@
168205
"source": [
169206
"We see that the number of trees used is far below 1000 with the current\n",
170207
"dataset. Training the gradient boosting model with the entire 1000 trees would\n",
171-
"have been detrimental."
208+
"have been detrimental.\n",
209+
"\n",
210+
"Please note that one should not hyperparameter tune the number of estimators\n",
211+
"for both random forest and gradient boosting models. In this exercise we only\n",
212+
"show model performance with varying `n_estimators` for educational purposes."
172213
]
173214
},
174215
{

python_scripts/ensemble_ex_03.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,9 @@
5858
# For both the gradient-boosting and random forest models, create a validation
5959
# curve using the training set to assess the impact of the number of trees on
6060
# the performance of each model. Evaluate the list of parameters `param_range =
61-
# np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error.
61+
# np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using
62+
# `neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the
63+
# right sign of the Mean Absolute Error.
6264

6365
# %%
6466
# Write your code here.

python_scripts/ensemble_sol_03.py

+35-5
Original file line numberDiff line numberDiff line change
@@ -58,15 +58,17 @@
5858
# For both the gradient-boosting and random forest models, create a validation
5959
# curve using the training set to assess the impact of the number of trees on
6060
# the performance of each model. Evaluate the list of parameters `param_range =
61-
# np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error.
61+
# np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using
62+
# `neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the
63+
# right sign of the Mean Absolute Error.
6264

6365
# %%
6466
# solution
6567
import numpy as np
6668

6769
from sklearn.model_selection import ValidationCurveDisplay
6870

69-
param_range = np.array([1, 2, 5, 10, 20, 50, 100])
71+
param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200])
7072
disp = ValidationCurveDisplay.from_estimator(
7173
forest,
7274
data_train,
@@ -90,6 +92,30 @@
9092
# ensemble. However, the scores reach a plateau where adding new trees just
9193
# makes fitting and scoring slower.
9294
#
95+
# Now repeat the analysis for the gradient boosting model.
96+
97+
# %%
98+
# solution
99+
disp = ValidationCurveDisplay.from_estimator(
100+
gbdt,
101+
data_train,
102+
target_train,
103+
param_name="n_estimators",
104+
param_range=param_range,
105+
scoring="neg_mean_absolute_error",
106+
negate_score=True,
107+
std_display_style="errorbar",
108+
n_jobs=2,
109+
)
110+
111+
_ = disp.ax_.set(
112+
xlabel="Number of trees in the gradient boosting model",
113+
ylabel="Mean absolute error (k$)",
114+
title="Validation curve for gradient boosting model",
115+
)
116+
117+
118+
# %% [markdown]
93119
# Gradient boosting models overfit when the number of trees is too large. To
94120
# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
95121
# offers an early-stopping option. Internally, the algorithm uses an
@@ -98,9 +124,9 @@
98124
# improving for several iterations, it stops adding trees.
99125
#
100126
# Now, create a gradient-boosting model with `n_estimators=1_000`. This number
101-
# of trees is certainly too large. Change the parameter `n_iter_no_change`
102-
# such that the gradient boosting fitting stops after adding 5 trees to avoid
103-
# deterioration of the overall generalization performance.
127+
# of trees is certainly too large as we have seen above. Change the parameter
128+
# `n_iter_no_change` such that the gradient boosting fitting stops after adding
129+
# 5 trees to avoid deterioration of the overall generalization performance.
104130

105131
# %%
106132
# solution
@@ -113,6 +139,10 @@
113139
# dataset. Training the gradient boosting model with the entire 1000 trees would
114140
# have been detrimental.
115141

142+
# Please note that one should not hyperparameter tune the number of estimators
143+
# for both random forest and gradient boosting models. In this exercise we only
144+
# show model performance with varying `n_estimators` for educational purposes.
145+
116146
# %% [markdown]
117147
# Estimate the generalization performance of this model again using the
118148
# `sklearn.metrics.mean_absolute_error` metric but this time using the test set

0 commit comments

Comments
 (0)