|
91 | 91 | "For both the gradient-boosting and random forest models, create a validation\n",
|
92 | 92 | "curve using the training set to assess the impact of the number of trees on\n",
|
93 | 93 | "the performance of each model. Evaluate the list of parameters `param_range =\n",
|
94 |
| - "np.array([1, 2, 5, 10, 20, 50, 100])` and use the mean absolute error." |
| 94 | + "np.array([1, 2, 5, 10, 20, 50, 100, 200])` and score it using\n", |
| 95 | + "`neg_mean_absolute_error`. Remember to set `negate_score=True` to recover the\n", |
| 96 | + "right sign of the Mean Absolute Error." |
95 | 97 | ]
|
96 | 98 | },
|
97 | 99 | {
|
|
105 | 107 | "\n",
|
106 | 108 | "from sklearn.model_selection import ValidationCurveDisplay\n",
|
107 | 109 | "\n",
|
108 |
| - "param_range = np.array([1, 2, 5, 10, 20, 50, 100])\n", |
| 110 | + "param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200])\n", |
109 | 111 | "disp = ValidationCurveDisplay.from_estimator(\n",
|
110 | 112 | " forest,\n",
|
111 | 113 | " data_train,\n",
|
|
133 | 135 | "ensemble. However, the scores reach a plateau where adding new trees just\n",
|
134 | 136 | "makes fitting and scoring slower.\n",
|
135 | 137 | "\n",
|
| 138 | + "Now repeat the analysis for the gradient boosting model." |
| 139 | + ] |
| 140 | + }, |
| 141 | + { |
| 142 | + "cell_type": "code", |
| 143 | + "execution_count": null, |
| 144 | + "metadata": { |
| 145 | + "lines_to_next_cell": 2 |
| 146 | + }, |
| 147 | + "outputs": [], |
| 148 | + "source": [ |
| 149 | + "# solution\n", |
| 150 | + "disp = ValidationCurveDisplay.from_estimator(\n", |
| 151 | + " gbdt,\n", |
| 152 | + " data_train,\n", |
| 153 | + " target_train,\n", |
| 154 | + " param_name=\"n_estimators\",\n", |
| 155 | + " param_range=param_range,\n", |
| 156 | + " scoring=\"neg_mean_absolute_error\",\n", |
| 157 | + " negate_score=True,\n", |
| 158 | + " std_display_style=\"errorbar\",\n", |
| 159 | + " n_jobs=2,\n", |
| 160 | + ")\n", |
| 161 | + "\n", |
| 162 | + "_ = disp.ax_.set(\n", |
| 163 | + " xlabel=\"Number of trees in the gradient boosting model\",\n", |
| 164 | + " ylabel=\"Mean absolute error (k$)\",\n", |
| 165 | + " title=\"Validation curve for gradient boosting model\",\n", |
| 166 | + ")" |
| 167 | + ] |
| 168 | + }, |
| 169 | + { |
| 170 | + "cell_type": "markdown", |
| 171 | + "metadata": {}, |
| 172 | + "source": [ |
136 | 173 | "Gradient boosting models overfit when the number of trees is too large. To\n",
|
137 | 174 | "avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
|
138 | 175 | "offers an early-stopping option. Internally, the algorithm uses an\n",
|
|
141 | 178 | "improving for several iterations, it stops adding trees.\n",
|
142 | 179 | "\n",
|
143 | 180 | "Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
|
144 |
| - "of trees is certainly too large. Change the parameter `n_iter_no_change`\n", |
145 |
| - "such that the gradient boosting fitting stops after adding 5 trees to avoid\n", |
146 |
| - "deterioration of the overall generalization performance." |
| 181 | + "of trees is certainly too large as we have seen above. Change the parameter\n", |
| 182 | + "`n_iter_no_change` such that the gradient boosting fitting stops after adding\n", |
| 183 | + "5 trees to avoid deterioration of the overall generalization performance." |
147 | 184 | ]
|
148 | 185 | },
|
149 | 186 | {
|
|
168 | 205 | "source": [
|
169 | 206 | "We see that the number of trees used is far below 1000 with the current\n",
|
170 | 207 | "dataset. Training the gradient boosting model with the entire 1000 trees would\n",
|
171 |
| - "have been detrimental." |
| 208 | + "have been detrimental.\n", |
| 209 | + "\n", |
| 210 | + "Please note that one should not hyperparameter tune the number of estimators\n", |
| 211 | + "for both random forest and gradient boosting models. In this exercise we only\n", |
| 212 | + "show model performance with varying `n_estimators` for educational purposes." |
172 | 213 | ]
|
173 | 214 | },
|
174 | 215 | {
|
|
0 commit comments