Skip to content

Commit 322cdf3

Browse files
authored
Doc: Adds documentation for the dataset compression argument from #1341 and #1250 (#1386)
* Add: Doc for `dataset_compression` * Fix: Shorten line * Doc: Make more clear that the argument None still provides defaults
1 parent 4b21321 commit 322cdf3

File tree

4 files changed

+123
-88
lines changed

4 files changed

+123
-88
lines changed

autosklearn/estimators.py

+68-87
Original file line numberDiff line numberDiff line change
@@ -156,58 +156,41 @@ def __init__(
156156
'feature_preprocessor': ["no_preprocessing"]
157157
}
158158
159-
resampling_strategy : Union[str, BaseCrossValidator, _RepeatedSplits, BaseShuffleSplit] = "holdout"
159+
resampling_strategy : str | BaseCrossValidator | _RepeatedSplits | BaseShuffleSplit = "holdout"
160160
How to to handle overfitting, might need to use ``resampling_strategy_arguments``
161161
if using ``"cv"`` based method or a Splitter object.
162162
163+
* **Options**
164+
* ``"holdout"`` - Use a 67:33 (train:test) split
165+
* ``"cv"``: perform cross validation, requires "folds" in ``resampling_strategy_arguments``
166+
* ``"holdout-iterative-fit"`` - Same as "holdout" but iterative fit where possible
167+
* ``"cv-iterative-fit"``: Same as "cv" but iterative fit where possible
168+
* ``"partial-cv"``: Same as "cv" but uses intensification.
169+
* ``BaseCrossValidator`` - any BaseCrossValidator subclass (found in scikit-learn model_selection module)
170+
* ``_RepeatedSplits`` - any _RepeatedSplits subclass (found in scikit-learn model_selection module)
171+
* ``BaseShuffleSplit`` - any BaseShuffleSplit subclass (found in scikit-learn model_selection module)
172+
163173
If using a Splitter object that relies on the dataset retaining it's current
164174
size and order, you will need to look at the ``dataset_compression`` argument
165175
and ensure that ``"subsample"`` is not included in the applied compression
166176
``"methods"`` or disable it entirely with ``False``.
167177
168-
**Options**
169-
170-
* ``"holdout"``:
171-
67:33 (train:test) split
172-
* ``"holdout-iterative-fit"``:
173-
67:33 (train:test) split, iterative fit where possible
174-
* ``"cv"``:
175-
crossvalidation,
176-
requires ``"folds"`` in ``resampling_strategy_arguments``
177-
* ``"cv-iterative-fit"``:
178-
crossvalidation,
179-
calls iterative fit where possible,
180-
requires ``"folds"`` in ``resampling_strategy_arguments``
181-
* 'partial-cv':
182-
crossvalidation with intensification,
183-
requires ``"folds"`` in ``resampling_strategy_arguments``
184-
* ``BaseCrossValidator`` subclass:
185-
any BaseCrossValidator subclass (found in scikit-learn model_selection module)
186-
* ``_RepeatedSplits`` subclass:
187-
any _RepeatedSplits subclass (found in scikit-learn model_selection module)
188-
* ``BaseShuffleSplit`` subclass:
189-
any BaseShuffleSplit subclass (found in scikit-learn model_selection module)
190-
191-
resampling_strategy_arguments : dict, optional if 'holdout' (train_size default=0.67)
192-
Additional arguments for resampling_strategy:
193-
194-
* ``train_size`` should be between 0.0 and 1.0 and represent the
195-
proportion of the dataset to include in the train split.
196-
* ``shuffle`` determines whether the data is shuffled prior to
197-
splitting it into train and validation.
198-
199-
Available arguments:
200-
201-
* 'holdout': {'train_size': float}
202-
* 'holdout-iterative-fit': {'train_size': float}
203-
* 'cv': {'folds': int}
204-
* 'cv-iterative-fit': {'folds': int}
205-
* 'partial-cv': {'folds': int, 'shuffle': bool}
206-
* BaseCrossValidator or _RepeatedSplits or BaseShuffleSplit object: all arguments
207-
required by chosen class as specified in scikit-learn documentation.
208-
If arguments are not provided, scikit-learn defaults are used.
209-
If no defaults are available, an exception is raised.
210-
Refer to the 'n_splits' argument as 'folds'.
178+
resampling_strategy_arguments : Optional[Dict] = None
179+
Additional arguments for ``resampling_strategy``, this is required if
180+
using a ``cv`` based strategy. The default arguments if left as ``None``
181+
are:
182+
183+
.. code-block:: python
184+
185+
{
186+
"train_size": 0.67, # The size of the training set
187+
"shuffle": True, # Whether to shuffle before splitting data
188+
"folds": 5 # Used in 'cv' based resampling strategies
189+
}
190+
191+
If using a custom splitter class, which takes ``n_splits`` such as
192+
`PredefinedSplit <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn-model-selection-kfold>`_,
193+
the value of ``"folds"`` will be used.
211194
212195
tmp_folder : string, optional (None)
213196
folder to store configuration output and log files, if ``None``
@@ -219,12 +202,12 @@ def __init__(
219202
220203
n_jobs : int, optional, experimental
221204
The number of jobs to run in parallel for ``fit()``. ``-1`` means
222-
using all processors.
223-
224-
**Important notes**:
225-
226-
* By default, Auto-sklearn uses one core.
227-
* Ensemble building is not affected by ``n_jobs`` but can be controlled by the number
205+
using all processors.
206+
207+
**Important notes**:
208+
209+
* By default, Auto-sklearn uses one core.
210+
* Ensemble building is not affected by ``n_jobs`` but can be controlled by the number
228211
of models in the ensemble.
229212
* ``predict()`` is not affected by ``n_jobs`` (in contrast to most scikit-learn models)
230213
* If ``dask_client`` is ``None``, a new dask client is created.
@@ -288,16 +271,14 @@ def __init__(
288271
289272
dataset_compression: Union[bool, Mapping[str, Any]] = True
290273
We compress datasets so that they fit into some predefined amount of memory.
291-
Currently this does not apply to dataframes or sparse arrays, only to raw numpy arrays.
274+
Currently this does not apply to dataframes or sparse arrays, only to raw
275+
numpy arrays.
292276
293-
**NOTE**
294-
295-
If using a custom ``resampling_strategy`` that relies on specific
277+
**NOTE** - If using a custom ``resampling_strategy`` that relies on specific
296278
size or ordering of data, this must be disabled to preserve these properties.
297279
298-
You can disable this entirely by passing ``False``.
299-
300-
Default configuration when left as ``True``:
280+
You can disable this entirely by passing ``False`` or leave as the default
281+
``True`` for configuration below.
301282
302283
.. code-block:: python
303284
@@ -311,36 +292,36 @@ def __init__(
311292
312293
The available options are described here:
313294
314-
**memory_allocation**
315-
316-
By default, we attempt to fit the dataset into ``0.1 * memory_limit``. This
317-
float value can be set with ``"memory_allocation": 0.1``. We also allow for
318-
specifying absolute memory in MB, e.g. 10MB is ``"memory_allocation": 10``.
319-
320-
The memory used by the dataset is checked after each reduction method is
321-
performed. If the dataset fits into the allocated memory, any further methods
322-
listed in ``"methods"`` will not be performed.
323-
324-
For example, if ``methods: ["precision", "subsample"]`` and the
325-
``"precision"`` reduction step was enough to make the dataset fit into memory,
326-
then the ``"subsample"`` reduction step will not be performed.
327-
328-
**methods**
329-
330-
We currently provide the following methods for reducing the dataset size.
331-
These can be provided in a list and are performed in the order as given.
332-
333-
* ``"precision"`` - We reduce floating point precision as follows:
334-
* ``np.float128 -> np.float64``
335-
* ``np.float96 -> np.float64``
336-
* ``np.float64 -> np.float32``
337-
338-
* ``subsample`` - We subsample data such that it **fits directly into the
339-
memory allocation** ``memory_allocation * memory_limit``. Therefore, this
340-
should likely be the last method listed in ``"methods"``.
341-
Subsampling takes into account classification labels and stratifies
342-
accordingly. We guarantee that at least one occurrence of each label is
343-
included in the sampled set.
295+
* **memory_allocation**
296+
By default, we attempt to fit the dataset into ``0.1 * memory_limit``.
297+
This float value can be set with ``"memory_allocation": 0.1``.
298+
We also allow for specifying absolute memory in MB, e.g. 10MB is
299+
``"memory_allocation": 10``.
300+
301+
The memory used by the dataset is checked after each reduction method is
302+
performed. If the dataset fits into the allocated memory, any further
303+
methods listed in ``"methods"`` will not be performed.
304+
305+
For example, if ``methods: ["precision", "subsample"]`` and the
306+
``"precision"`` reduction step was enough to make the dataset fit into
307+
memory, then the ``"subsample"`` reduction step will not be performed.
308+
309+
* **methods**
310+
We provide the following methods for reducing the dataset size.
311+
These can be provided in a list and are performed in the order as given.
312+
313+
* ``"precision"`` - We reduce floating point precision as follows:
314+
* ``np.float128 -> np.float64``
315+
* ``np.float96 -> np.float64``
316+
* ``np.float64 -> np.float32``
317+
318+
* ``subsample`` - We subsample data such that it **fits directly into
319+
the memory allocation** ``memory_allocation * memory_limit``.
320+
Therefore, this should likely be the last method listed in
321+
``"methods"``.
322+
Subsampling takes into account classification labels and stratifies
323+
accordingly. We guarantee that at least one occurrence of each
324+
label is included in the sampled set.
344325
345326
Attributes
346327
----------

doc/conf.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@
198198
('Start', 'index'),
199199
('Releases', 'releases'),
200200
('Installation', 'installation'),
201-
#('Manual', 'manual'),
201+
('Manual', 'manual'),
202202
('Examples', 'examples/index'),
203203
('API', 'api'),
204204
('Extending', 'extending'),

doc/faq.rst

+6
Original file line numberDiff line numberDiff line change
@@ -409,6 +409,12 @@ Configuring the Search Procedure
409409

410410
Examples for using holdout and cross-validation can be found in :ref:`example <sphx_glr_examples_40_advanced_example_resampling.py>`
411411

412+
If using a custom resampling strategy with predefined splits, you may need to disable
413+
the subsampling performed with particularly large datasets or if using a small ``memory_limit``.
414+
Please see the manual section on :ref:`limits`
415+
:class:`AutoSklearnClassifier(dataset_compression=...) <autosklearn.classification.AutoSklearnClassifier>`.
416+
for more details.
417+
412418
.. collapse:: <b>Can I use a custom metric</b>
413419

414420
Examples for using a custom metric can be found in :ref:`example <sphx_glr_examples_40_advanced_example_metrics.py>`

doc/manual.rst

+48
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,54 @@ tested.
4545

4646
By default, *auto-sklearn* uses **one core**. See also :ref:`parallel` on how to configure this.
4747

48+
49+
.. collapse:: <b>Managing data compression</b>
50+
51+
.. _manual_managing_data_compression:
52+
53+
Auto-sklearn will attempt to fit the dataset into 1/10th of the ``memory_limit``.
54+
This won't happen unless your dataset is quite large or you have small a
55+
``memory_limit``. This is done using two methods, reducing **precision** and
56+
to **subsample**. One reason you may want to control this is if you require high
57+
precision or you rely on predefined splits for which subsampling does not account
58+
for.
59+
60+
To turn off data preprocessing:
61+
62+
.. code:: python
63+
64+
AutoSklearnClassifier(
65+
dataset_compression = False
66+
)
67+
68+
You can specify which of the methods are performed using:
69+
70+
.. code:: python
71+
72+
AutoSklearnClassifier(
73+
dataset_compression = { "methods": ["precision", "subsample"] },
74+
)
75+
76+
You can change the memory allocation for the dataset to a percentage of ``memory_limit``
77+
or an absolute amount using:
78+
79+
.. code:: python
80+
81+
AutoSklearnClassifier(
82+
dataset_compression = { "memory_allocation": 0.2 },
83+
)
84+
85+
The default arguments are used when ``dataset_compression = True`` are:
86+
87+
.. code:: python
88+
89+
{
90+
"memory_allocation": 0.1,
91+
"methods": ["precision", "subsample"]
92+
}
93+
94+
The full description is given at :class:`AutoSklearnClassifier(dataset_compression=...) <autosklearn.classification.AutoSklearnClassifier>`.
95+
4896
.. _space:
4997

5098
The search space

0 commit comments

Comments
 (0)