63
63
< li > < a href ="../../index.html "> Start</ a > </ li >
64
64
< li > < a href ="../../releases.html "> Releases</ a > </ li >
65
65
< li > < a href ="../../installation.html "> Installation</ a > </ li >
66
- < li > < a href ="../../manual.html "> Manual</ a > </ li >
67
66
< li > < a href ="../../examples/index.html "> Examples</ a > </ li >
68
67
< li > < a href ="../../api.html "> API</ a > </ li >
69
68
< li > < a href ="../../extending.html "> Extending</ a > </ li >
@@ -269,39 +268,58 @@ <h1>Source code for autosklearn.estimators</h1><div class="highlight"><pre>
269
268
< span class ="sd "> 'feature_preprocessor': ["no_preprocessing"]</ span >
270
269
< span class ="sd "> }</ span >
271
270
272
- < span class ="sd "> resampling_strategy : str | BaseCrossValidator | _RepeatedSplits | BaseShuffleSplit = "holdout"</ span >
271
+ < span class ="sd "> resampling_strategy : Union[ str, BaseCrossValidator, _RepeatedSplits, BaseShuffleSplit] = "holdout"</ span >
273
272
< span class ="sd "> How to to handle overfitting, might need to use ``resampling_strategy_arguments``</ span >
274
273
< span class ="sd "> if using ``"cv"`` based method or a Splitter object.</ span >
275
274
276
- < span class ="sd "> * **Options**</ span >
277
- < span class ="sd "> * ``"holdout"`` - Use a 67:33 (train:test) split</ span >
278
- < span class ="sd "> * ``"cv"``: perform cross validation, requires "folds" in ``resampling_strategy_arguments``</ span >
279
- < span class ="sd "> * ``"holdout-iterative-fit"`` - Same as "holdout" but iterative fit where possible</ span >
280
- < span class ="sd "> * ``"cv-iterative-fit"``: Same as "cv" but iterative fit where possible</ span >
281
- < span class ="sd "> * ``"partial-cv"``: Same as "cv" but uses intensification.</ span >
282
- < span class ="sd "> * ``BaseCrossValidator`` - any BaseCrossValidator subclass (found in scikit-learn model_selection module)</ span >
283
- < span class ="sd "> * ``_RepeatedSplits`` - any _RepeatedSplits subclass (found in scikit-learn model_selection module)</ span >
284
- < span class ="sd "> * ``BaseShuffleSplit`` - any BaseShuffleSplit subclass (found in scikit-learn model_selection module)</ span >
285
-
286
275
< span class ="sd "> If using a Splitter object that relies on the dataset retaining it's current</ span >
287
276
< span class ="sd "> size and order, you will need to look at the ``dataset_compression`` argument</ span >
288
277
< span class ="sd "> and ensure that ``"subsample"`` is not included in the applied compression</ span >
289
278
< span class ="sd "> ``"methods"`` or disable it entirely with ``False``.</ span >
290
279
291
- < span class ="sd "> resampling_strategy_arguments : Optional[Dict]</ span >
292
- < span class ="sd "> Additional arguments for ``resampling_strategy``, this is required if</ span >
293
- < span class ="sd "> using a ``cv`` based strategy:</ span >
294
-
295
- < span class ="sd "> .. code-block:: python</ span >
296
-
297
- < span class ="sd "> {</ span >
298
- < span class ="sd "> "train_size": 0.67, # The size of the training set</ span >
299
- < span class ="sd "> "shuffle": True, # Whether to shuffle before splitting data</ span >
300
- < span class ="sd "> "folds": 5 # Used in 'cv' based resampling strategies</ span >
301
- < span class ="sd "> }</ span >
302
-
303
- < span class ="sd "> If using a custom splitter class, which takes ``n_splits`` such as</ span >
304
- < span class ="sd "> `PredefinedSplit <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn-model-selection-kfold>`_, the value of ``"folds"`` will be used.</ span >
280
+ < span class ="sd "> **Options**</ span >
281
+
282
+ < span class ="sd "> * ``"holdout"``:</ span >
283
+ < span class ="sd "> 67:33 (train:test) split</ span >
284
+ < span class ="sd "> * ``"holdout-iterative-fit"``:</ span >
285
+ < span class ="sd "> 67:33 (train:test) split, iterative fit where possible</ span >
286
+ < span class ="sd "> * ``"cv"``:</ span >
287
+ < span class ="sd "> crossvalidation,</ span >
288
+ < span class ="sd "> requires ``"folds"`` in ``resampling_strategy_arguments``</ span >
289
+ < span class ="sd "> * ``"cv-iterative-fit"``:</ span >
290
+ < span class ="sd "> crossvalidation,</ span >
291
+ < span class ="sd "> calls iterative fit where possible,</ span >
292
+ < span class ="sd "> requires ``"folds"`` in ``resampling_strategy_arguments``</ span >
293
+ < span class ="sd "> * 'partial-cv':</ span >
294
+ < span class ="sd "> crossvalidation with intensification,</ span >
295
+ < span class ="sd "> requires ``"folds"`` in ``resampling_strategy_arguments``</ span >
296
+ < span class ="sd "> * ``BaseCrossValidator`` subclass:</ span >
297
+ < span class ="sd "> any BaseCrossValidator subclass (found in scikit-learn model_selection module)</ span >
298
+ < span class ="sd "> * ``_RepeatedSplits`` subclass:</ span >
299
+ < span class ="sd "> any _RepeatedSplits subclass (found in scikit-learn model_selection module)</ span >
300
+ < span class ="sd "> * ``BaseShuffleSplit`` subclass:</ span >
301
+ < span class ="sd "> any BaseShuffleSplit subclass (found in scikit-learn model_selection module)</ span >
302
+
303
+ < span class ="sd "> resampling_strategy_arguments : dict, optional if 'holdout' (train_size default=0.67)</ span >
304
+ < span class ="sd "> Additional arguments for resampling_strategy:</ span >
305
+
306
+ < span class ="sd "> * ``train_size`` should be between 0.0 and 1.0 and represent the</ span >
307
+ < span class ="sd "> proportion of the dataset to include in the train split.</ span >
308
+ < span class ="sd "> * ``shuffle`` determines whether the data is shuffled prior to</ span >
309
+ < span class ="sd "> splitting it into train and validation.</ span >
310
+
311
+ < span class ="sd "> Available arguments:</ span >
312
+
313
+ < span class ="sd "> * 'holdout': {'train_size': float}</ span >
314
+ < span class ="sd "> * 'holdout-iterative-fit': {'train_size': float}</ span >
315
+ < span class ="sd "> * 'cv': {'folds': int}</ span >
316
+ < span class ="sd "> * 'cv-iterative-fit': {'folds': int}</ span >
317
+ < span class ="sd "> * 'partial-cv': {'folds': int, 'shuffle': bool}</ span >
318
+ < span class ="sd "> * BaseCrossValidator or _RepeatedSplits or BaseShuffleSplit object: all arguments</ span >
319
+ < span class ="sd "> required by chosen class as specified in scikit-learn documentation.</ span >
320
+ < span class ="sd "> If arguments are not provided, scikit-learn defaults are used.</ span >
321
+ < span class ="sd "> If no defaults are available, an exception is raised.</ span >
322
+ < span class ="sd "> Refer to the 'n_splits' argument as 'folds'.</ span >
305
323
306
324
< span class ="sd "> tmp_folder : string, optional (None)</ span >
307
325
< span class ="sd "> folder to store configuration output and log files, if ``None``</ span >
@@ -313,12 +331,12 @@ <h1>Source code for autosklearn.estimators</h1><div class="highlight"><pre>
313
331
314
332
< span class ="sd "> n_jobs : int, optional, experimental</ span >
315
333
< span class ="sd "> The number of jobs to run in parallel for ``fit()``. ``-1`` means</ span >
316
- < span class ="sd "> using all processors.</ span >
317
-
318
- < span class ="sd "> **Important notes**:</ span >
319
-
320
- < span class ="sd "> * By default, Auto-sklearn uses one core.</ span >
321
- < span class ="sd "> * Ensemble building is not affected by ``n_jobs`` but can be controlled by the number</ span >
334
+ < span class ="sd "> using all processors. </ span >
335
+ < span class =" sd " > </ span >
336
+ < span class ="sd "> **Important notes**: </ span >
337
+ < span class =" sd " > </ span >
338
+ < span class ="sd "> * By default, Auto-sklearn uses one core. </ span >
339
+ < span class ="sd "> * Ensemble building is not affected by ``n_jobs`` but can be controlled by the number </ span >
322
340
< span class ="sd "> of models in the ensemble.</ span >
323
341
< span class ="sd "> * ``predict()`` is not affected by ``n_jobs`` (in contrast to most scikit-learn models)</ span >
324
342
< span class ="sd "> * If ``dask_client`` is ``None``, a new dask client is created.</ span >
@@ -382,14 +400,16 @@ <h1>Source code for autosklearn.estimators</h1><div class="highlight"><pre>
382
400
383
401
< span class ="sd "> dataset_compression: Union[bool, Mapping[str, Any]] = True</ span >
384
402
< span class ="sd "> We compress datasets so that they fit into some predefined amount of memory.</ span >
385
- < span class ="sd "> Currently this does not apply to dataframes or sparse arrays, only to raw</ span >
386
- < span class ="sd "> numpy arrays.</ span >
403
+ < span class ="sd "> Currently this does not apply to dataframes or sparse arrays, only to raw numpy arrays.</ span >
387
404
388
- < span class ="sd "> **NOTE** - If using a custom ``resampling_strategy`` that relies on specific</ span >
405
+ < span class ="sd "> **NOTE**</ span >
406
+
407
+ < span class ="sd "> If using a custom ``resampling_strategy`` that relies on specific</ span >
389
408
< span class ="sd "> size or ordering of data, this must be disabled to preserve these properties.</ span >
390
409
391
- < span class ="sd "> You can disable this entirely by passing ``False`` or leave as the default</ span >
392
- < span class ="sd "> ``True`` for configuration below.</ span >
410
+ < span class ="sd "> You can disable this entirely by passing ``False``.</ span >
411
+
412
+ < span class ="sd "> Default configuration when left as ``True``:</ span >
393
413
394
414
< span class ="sd "> .. code-block:: python</ span >
395
415
@@ -403,36 +423,36 @@ <h1>Source code for autosklearn.estimators</h1><div class="highlight"><pre>
403
423
404
424
< span class ="sd "> The available options are described here:</ span >
405
425
406
- < span class ="sd "> * * *memory_allocation**</ span >
407
- < span class =" sd " > By default, we attempt to fit the dataset into ``0.1 * memory_limit``. </ span >
408
- < span class ="sd "> This float value can be set with ``"memory_allocation": 0.1``.</ span >
409
- < span class ="sd "> We also allow for specifying absolute memory in MB, e.g. 10MB is </ span >
410
- < span class ="sd "> ``"memory_allocation": 10``.</ span >
411
-
412
- < span class ="sd "> The memory used by the dataset is checked after each reduction method is</ span >
413
- < span class ="sd "> performed. If the dataset fits into the allocated memory, any further</ span >
414
- < span class ="sd "> methods listed in ``"methods"`` will not be performed.</ span >
415
-
416
- < span class ="sd "> For example, if ``methods: ["precision", "subsample"]`` and the</ span >
417
- < span class ="sd "> ``"precision"`` reduction step was enough to make the dataset fit into</ span >
418
- < span class ="sd "> memory, then the ``"subsample"`` reduction step will not be performed.</ span >
419
-
420
- < span class ="sd "> * * *methods**</ span >
421
- < span class =" sd " > We provide the following methods for reducing the dataset size. </ span >
422
- < span class ="sd "> These can be provided in a list and are performed in the order as given .</ span >
423
-
424
- < span class =" sd " > * ``"precision"`` - We reduce floating point precision as follows: </ span >
425
- < span class ="sd "> * ``np.float128 -> np.float64`` </ span >
426
- < span class ="sd "> * ``np.float96 -> np.float64``</ span >
427
- < span class ="sd "> * ``np.float64 -> np.float32 ``</ span >
428
-
429
- < span class =" sd " > * ``subsample`` - We subsample data such that it **fits directly into </ span >
430
- < span class ="sd "> the memory allocation** ``memory_allocation * memory_limit``. </ span >
431
- < span class ="sd "> Therefore, this should likely be the last method listed in </ span >
432
- < span class ="sd "> ``"methods"``.</ span >
433
- < span class ="sd "> Subsampling takes into account classification labels and stratifies</ span >
434
- < span class ="sd "> accordingly. We guarantee that at least one occurrence of each</ span >
435
- < span class ="sd "> label is included in the sampled set.</ span >
426
+ < span class ="sd "> **memory_allocation**</ span >
427
+
428
+ < span class ="sd "> By default, we attempt to fit the dataset into `` 0.1 * memory_limit ``. This </ span >
429
+ < span class ="sd "> float value can be set with ``"memory_allocation": 0.1``. We also allow for </ span >
430
+ < span class ="sd "> specifying absolute memory in MB, e.g. 10MB is ``"memory_allocation": 10``.</ span >
431
+
432
+ < span class ="sd "> The memory used by the dataset is checked after each reduction method is</ span >
433
+ < span class ="sd "> performed. If the dataset fits into the allocated memory, any further methods </ span >
434
+ < span class ="sd "> listed in ``"methods"`` will not be performed.</ span >
435
+
436
+ < span class ="sd "> For example, if ``methods: ["precision", "subsample"]`` and the</ span >
437
+ < span class ="sd "> ``"precision"`` reduction step was enough to make the dataset fit into memory, </ span >
438
+ < span class ="sd "> then the ``"subsample"`` reduction step will not be performed.</ span >
439
+
440
+ < span class ="sd "> **methods**</ span >
441
+
442
+ < span class ="sd "> We currently provide the following methods for reducing the dataset size .</ span >
443
+ < span class =" sd " > These can be provided in a list and are performed in the order as given. </ span >
444
+
445
+ < span class ="sd "> * ``"precision"`` - We reduce floating point precision as follows: </ span >
446
+ < span class ="sd "> * ``np.float128 -> np.float64``</ span >
447
+ < span class ="sd "> * ``np.float96 -> np.float64 ``</ span >
448
+ < span class =" sd " > * ``np.float64 -> np.float32`` </ span >
449
+
450
+ < span class ="sd "> * ``subsample`` - We subsample data such that it **fits directly into the </ span >
451
+ < span class ="sd "> memory allocation** ``memory_allocation * memory_limit``. Therefore, this</ span >
452
+ < span class ="sd "> should likely be the last method listed in ``"methods"``.</ span >
453
+ < span class ="sd "> Subsampling takes into account classification labels and stratifies</ span >
454
+ < span class ="sd "> accordingly. We guarantee that at least one occurrence of each label is </ span >
455
+ < span class ="sd "> included in the sampled set.</ span >
436
456
437
457
< span class ="sd "> Attributes</ span >
438
458
< span class ="sd "> ----------</ span >
0 commit comments