@@ -156,58 +156,41 @@ def __init__(
156
156
'feature_preprocessor': ["no_preprocessing"]
157
157
}
158
158
159
- resampling_strategy : Union[ str, BaseCrossValidator, _RepeatedSplits, BaseShuffleSplit] = "holdout"
159
+ resampling_strategy : str | BaseCrossValidator | _RepeatedSplits | BaseShuffleSplit = "holdout"
160
160
How to to handle overfitting, might need to use ``resampling_strategy_arguments``
161
161
if using ``"cv"`` based method or a Splitter object.
162
162
163
+ * **Options**
164
+ * ``"holdout"`` - Use a 67:33 (train:test) split
165
+ * ``"cv"``: perform cross validation, requires "folds" in ``resampling_strategy_arguments``
166
+ * ``"holdout-iterative-fit"`` - Same as "holdout" but iterative fit where possible
167
+ * ``"cv-iterative-fit"``: Same as "cv" but iterative fit where possible
168
+ * ``"partial-cv"``: Same as "cv" but uses intensification.
169
+ * ``BaseCrossValidator`` - any BaseCrossValidator subclass (found in scikit-learn model_selection module)
170
+ * ``_RepeatedSplits`` - any _RepeatedSplits subclass (found in scikit-learn model_selection module)
171
+ * ``BaseShuffleSplit`` - any BaseShuffleSplit subclass (found in scikit-learn model_selection module)
172
+
163
173
If using a Splitter object that relies on the dataset retaining it's current
164
174
size and order, you will need to look at the ``dataset_compression`` argument
165
175
and ensure that ``"subsample"`` is not included in the applied compression
166
176
``"methods"`` or disable it entirely with ``False``.
167
177
168
- **Options**
169
-
170
- * ``"holdout"``:
171
- 67:33 (train:test) split
172
- * ``"holdout-iterative-fit"``:
173
- 67:33 (train:test) split, iterative fit where possible
174
- * ``"cv"``:
175
- crossvalidation,
176
- requires ``"folds"`` in ``resampling_strategy_arguments``
177
- * ``"cv-iterative-fit"``:
178
- crossvalidation,
179
- calls iterative fit where possible,
180
- requires ``"folds"`` in ``resampling_strategy_arguments``
181
- * 'partial-cv':
182
- crossvalidation with intensification,
183
- requires ``"folds"`` in ``resampling_strategy_arguments``
184
- * ``BaseCrossValidator`` subclass:
185
- any BaseCrossValidator subclass (found in scikit-learn model_selection module)
186
- * ``_RepeatedSplits`` subclass:
187
- any _RepeatedSplits subclass (found in scikit-learn model_selection module)
188
- * ``BaseShuffleSplit`` subclass:
189
- any BaseShuffleSplit subclass (found in scikit-learn model_selection module)
190
-
191
- resampling_strategy_arguments : dict, optional if 'holdout' (train_size default=0.67)
192
- Additional arguments for resampling_strategy:
193
-
194
- * ``train_size`` should be between 0.0 and 1.0 and represent the
195
- proportion of the dataset to include in the train split.
196
- * ``shuffle`` determines whether the data is shuffled prior to
197
- splitting it into train and validation.
198
-
199
- Available arguments:
200
-
201
- * 'holdout': {'train_size': float}
202
- * 'holdout-iterative-fit': {'train_size': float}
203
- * 'cv': {'folds': int}
204
- * 'cv-iterative-fit': {'folds': int}
205
- * 'partial-cv': {'folds': int, 'shuffle': bool}
206
- * BaseCrossValidator or _RepeatedSplits or BaseShuffleSplit object: all arguments
207
- required by chosen class as specified in scikit-learn documentation.
208
- If arguments are not provided, scikit-learn defaults are used.
209
- If no defaults are available, an exception is raised.
210
- Refer to the 'n_splits' argument as 'folds'.
178
+ resampling_strategy_arguments : Optional[Dict] = None
179
+ Additional arguments for ``resampling_strategy``, this is required if
180
+ using a ``cv`` based strategy. The default arguments if left as ``None``
181
+ are:
182
+
183
+ .. code-block:: python
184
+
185
+ {
186
+ "train_size": 0.67, # The size of the training set
187
+ "shuffle": True, # Whether to shuffle before splitting data
188
+ "folds": 5 # Used in 'cv' based resampling strategies
189
+ }
190
+
191
+ If using a custom splitter class, which takes ``n_splits`` such as
192
+ `PredefinedSplit <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn-model-selection-kfold>`_,
193
+ the value of ``"folds"`` will be used.
211
194
212
195
tmp_folder : string, optional (None)
213
196
folder to store configuration output and log files, if ``None``
@@ -219,12 +202,12 @@ def __init__(
219
202
220
203
n_jobs : int, optional, experimental
221
204
The number of jobs to run in parallel for ``fit()``. ``-1`` means
222
- using all processors.
223
-
224
- **Important notes**:
225
-
226
- * By default, Auto-sklearn uses one core.
227
- * Ensemble building is not affected by ``n_jobs`` but can be controlled by the number
205
+ using all processors.
206
+
207
+ **Important notes**:
208
+
209
+ * By default, Auto-sklearn uses one core.
210
+ * Ensemble building is not affected by ``n_jobs`` but can be controlled by the number
228
211
of models in the ensemble.
229
212
* ``predict()`` is not affected by ``n_jobs`` (in contrast to most scikit-learn models)
230
213
* If ``dask_client`` is ``None``, a new dask client is created.
@@ -288,16 +271,14 @@ def __init__(
288
271
289
272
dataset_compression: Union[bool, Mapping[str, Any]] = True
290
273
We compress datasets so that they fit into some predefined amount of memory.
291
- Currently this does not apply to dataframes or sparse arrays, only to raw numpy arrays.
274
+ Currently this does not apply to dataframes or sparse arrays, only to raw
275
+ numpy arrays.
292
276
293
- **NOTE**
294
-
295
- If using a custom ``resampling_strategy`` that relies on specific
277
+ **NOTE** - If using a custom ``resampling_strategy`` that relies on specific
296
278
size or ordering of data, this must be disabled to preserve these properties.
297
279
298
- You can disable this entirely by passing ``False``.
299
-
300
- Default configuration when left as ``True``:
280
+ You can disable this entirely by passing ``False`` or leave as the default
281
+ ``True`` for configuration below.
301
282
302
283
.. code-block:: python
303
284
@@ -311,36 +292,36 @@ def __init__(
311
292
312
293
The available options are described here:
313
294
314
- **memory_allocation**
315
-
316
- By default, we attempt to fit the dataset into `` 0.1 * memory_limit ``. This
317
- float value can be set with ``"memory_allocation": 0.1``. We also allow for
318
- specifying absolute memory in MB, e.g. 10MB is ``"memory_allocation": 10``.
319
-
320
- The memory used by the dataset is checked after each reduction method is
321
- performed. If the dataset fits into the allocated memory, any further methods
322
- listed in ``"methods"`` will not be performed.
323
-
324
- For example, if ``methods: ["precision", "subsample"]`` and the
325
- ``"precision"`` reduction step was enough to make the dataset fit into memory,
326
- then the ``"subsample"`` reduction step will not be performed.
327
-
328
- **methods**
329
-
330
- We currently provide the following methods for reducing the dataset size .
331
- These can be provided in a list and are performed in the order as given.
332
-
333
- * ``"precision"`` - We reduce floating point precision as follows:
334
- * ``np.float128 -> np.float64``
335
- * ``np.float96 -> np.float64 ``
336
- * ``np.float64 -> np.float32``
337
-
338
- * ``subsample`` - We subsample data such that it **fits directly into the
339
- memory allocation** ``memory_allocation * memory_limit``. Therefore, this
340
- should likely be the last method listed in ``"methods"``.
341
- Subsampling takes into account classification labels and stratifies
342
- accordingly. We guarantee that at least one occurrence of each label is
343
- included in the sampled set.
295
+ * * *memory_allocation**
296
+ By default, we attempt to fit the dataset into ``0.1 * memory_limit``.
297
+ This float value can be set with ``"memory_allocation": 0.1``.
298
+ We also allow for specifying absolute memory in MB, e.g. 10MB is
299
+ ``"memory_allocation": 10``.
300
+
301
+ The memory used by the dataset is checked after each reduction method is
302
+ performed. If the dataset fits into the allocated memory, any further
303
+ methods listed in ``"methods"`` will not be performed.
304
+
305
+ For example, if ``methods: ["precision", "subsample"]`` and the
306
+ ``"precision"`` reduction step was enough to make the dataset fit into
307
+ memory, then the ``"subsample"`` reduction step will not be performed.
308
+
309
+ * * *methods**
310
+ We provide the following methods for reducing the dataset size.
311
+ These can be provided in a list and are performed in the order as given .
312
+
313
+ * ``"precision"`` - We reduce floating point precision as follows:
314
+ * ``np.float128 -> np.float64``
315
+ * ``np.float96 -> np.float64``
316
+ * ``np.float64 -> np.float32 ``
317
+
318
+ * ``subsample`` - We subsample data such that it **fits directly into
319
+ the memory allocation** ``memory_allocation * memory_limit``.
320
+ Therefore, this should likely be the last method listed in
321
+ ``"methods"``.
322
+ Subsampling takes into account classification labels and stratifies
323
+ accordingly. We guarantee that at least one occurrence of each
324
+ label is included in the sampled set.
344
325
345
326
Attributes
346
327
----------
0 commit comments