Add validation curve for n_estimators plus advise on (not) tuning (#827)

fritshermans · web-flow · commit 425071ebbfbd · 2025-04-11T16:38:06.000+02:00
diff --git a/notebooks/ensemble_ex_03.ipynb b/notebooks/ensemble_ex_03.ipynb
@@ -85,7 +85,9 @@
     "For both the gradient-boosting and random forest models, create a validation\n",
     "curve using the training set to assess the impact of the number of trees on\n",
     "the performance of each model. Evaluate the list of parameters `param_range =\n",
-    "np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error."
+    "np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using\n",
+    "`neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the\n",
+    "right sign of the Mean Absolute Error."
    ]
   },
   {
diff --git a/notebooks/ensemble_sol_03.ipynb b/notebooks/ensemble_sol_03.ipynb
@@ -91,7 +91,9 @@
     "For both the gradient-boosting and random forest models, create a validation\n",
     "curve using the training set to assess the impact of the number of trees on\n",
     "the performance of each model. Evaluate the list of parameters `param_range =\n",
-    "np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error."
+    "np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using\n",
+    "`neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the\n",
+    "right sign of the Mean Absolute Error."
    ]
   },
   {
@@ -105,7 +107,7 @@
     "\n",
     "from sklearn.model_selection import ValidationCurveDisplay\n",
     "\n",
-    "param_range = np.array([1, 2, 5, 10, 20, 50, 100])\n",
+    "param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200])\n",
     "disp = ValidationCurveDisplay.from_estimator(\n",
     "    forest,\n",
     "    data_train,\n",
@@ -133,6 +135,41 @@
     "ensemble. However, the scores reach a plateau where adding new trees just\n",
     "makes fitting and scoring slower.\n",
     "\n",
+    "Now repeat the analysis for the gradient boosting model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "# solution\n",
+    "disp = ValidationCurveDisplay.from_estimator(\n",
+    "    gbdt,\n",
+    "    data_train,\n",
+    "    target_train,\n",
+    "    param_name=\"n_estimators\",\n",
+    "    param_range=param_range,\n",
+    "    scoring=\"neg_mean_absolute_error\",\n",
+    "    negate_score=True,\n",
+    "    std_display_style=\"errorbar\",\n",
+    "    n_jobs=2,\n",
+    ")\n",
+    "\n",
+    "_ = disp.ax_.set(\n",
+    "    xlabel=\"Number of trees in the gradient boosting model\",\n",
+    "    ylabel=\"Mean absolute error (k$)\",\n",
+    "    title=\"Validation curve for gradient boosting model\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "Gradient boosting models overfit when the number of trees is too large. To\n",
     "avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
     "offers an early-stopping option. Internally, the algorithm uses an\n",
@@ -141,9 +178,9 @@
     "improving for several iterations, it stops adding trees.\n",
     "\n",
     "Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
-    "of trees is certainly too large. Change the parameter `n_iter_no_change`\n",
-    "such that the gradient boosting fitting stops after adding 5 trees to avoid\n",
-    "deterioration of the overall generalization performance."
+    "of trees is certainly too large as we have seen above. Change the parameter\n",
+    "`n_iter_no_change` such that the gradient boosting fitting stops after adding\n",
+    "5 trees to avoid deterioration of the overall generalization performance."
    ]
   },
   {
@@ -168,7 +205,11 @@
    "source": [
     "We see that the number of trees used is far below 1000 with the current\n",
     "dataset. Training the gradient boosting model with the entire 1000 trees would\n",
-    "have been detrimental."
+    "have been detrimental.\n",
+    "\n",
+    "Please note that one should not hyperparameter tune the number of estimators\n",
+    "for both random forest and gradient boosting models. In this exercise we only\n",
+    "show model performance with varying `n_estimators` for educational purposes."
    ]
   },
   {
diff --git a/python_scripts/ensemble_ex_03.py b/python_scripts/ensemble_ex_03.py
@@ -58,7 +58,9 @@
 # For both the gradient-boosting and random forest models, create a validation
 # curve using the training set to assess the impact of the number of trees on
 # the performance of each model. Evaluate the list of parameters `param_range =
-# np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error.
+# np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using
+# `neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the
+# right sign of the Mean Absolute Error.
 
 # %%
 # Write your code here.
diff --git a/python_scripts/ensemble_sol_03.py b/python_scripts/ensemble_sol_03.py
@@ -58,15 +58,17 @@
 # For both the gradient-boosting and random forest models, create a validation
 # curve using the training set to assess the impact of the number of trees on
 # the performance of each model. Evaluate the list of parameters `param_range =
-# np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error.
+# np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using
+# `neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the
+# right sign of the Mean Absolute Error.
 
 # %%
 # solution
 import numpy as np
 
 from sklearn.model_selection import ValidationCurveDisplay
 
-param_range = np.array([1, 2, 5, 10, 20, 50, 100])
+param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200])
 disp = ValidationCurveDisplay.from_estimator(
     forest,
     data_train,
@@ -90,6 +92,30 @@
 # ensemble. However, the scores reach a plateau where adding new trees just
 # makes fitting and scoring slower.
 #
+# Now repeat the analysis for the gradient boosting model.
+
+# %%
+# solution
+disp = ValidationCurveDisplay.from_estimator(
+    gbdt,
+    data_train,
+    target_train,
+    param_name="n_estimators",
+    param_range=param_range,
+    scoring="neg_mean_absolute_error",
+    negate_score=True,
+    std_display_style="errorbar",
+    n_jobs=2,
+)
+
+_ = disp.ax_.set(
+    xlabel="Number of trees in the gradient boosting model",
+    ylabel="Mean absolute error (k$)",
+    title="Validation curve for gradient boosting model",
+)
+
+
+# %% [markdown]
 # Gradient boosting models overfit when the number of trees is too large. To
 # avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
 # offers an early-stopping option. Internally, the algorithm uses an
@@ -98,9 +124,9 @@
 # improving for several iterations, it stops adding trees.
 #
 # Now, create a gradient-boosting model with `n_estimators=1_000`. This number
-# of trees is certainly too large. Change the parameter `n_iter_no_change`
-# such that the gradient boosting fitting stops after adding 5 trees to avoid
-# deterioration of the overall generalization performance.
+# of trees is certainly too large as we have seen above. Change the parameter
+# `n_iter_no_change` such that the gradient boosting fitting stops after adding
+# 5 trees to avoid deterioration of the overall generalization performance.
 
 # %%
 # solution
@@ -113,6 +139,10 @@
 # dataset. Training the gradient boosting model with the entire 1000 trees would
 # have been detrimental.
 
+# Please note that one should not hyperparameter tune the number of estimators
+# for both random forest and gradient boosting models. In this exercise we only
+# show model performance with varying `n_estimators` for educational purposes.
+
 # %% [markdown]
 # Estimate the generalization performance of this model again using the
 # `sklearn.metrics.mean_absolute_error` metric but this time using the test set

Original file line number	Diff line number	Diff line change
`@@ -85,7 +85,9 @@`
`85`	`85`	`"For both the gradient-boosting and random forest models, create a validation\n",`
`86`	`86`	`"curve using the training set to assess the impact of the number of trees on\n",`
`87`	`87`	"the performance of each model. Evaluate the list of parameters `param_range =\n",
`88`		- "np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error."
	`88`	+ "np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using\n",
	`89`	+ "`neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the\n",
	`90`	`+ "right sign of the Mean Absolute Error."`
`89`	`91`	`]`
`90`	`92`	`},`
`91`	`93`	`{`