auto-sklearn produces probability matrix inconsistent with training input #1190

PGijsbers · 2021-07-27T16:26:06Z

Describe the bug

When the dataset has outliers and is big enough to be subsampled, it can produce a probability matrix which has fewer columns than classes in the training data.

To Reproduce

import numpy as np
from autosklearn.experimental.askl2 import AutoSklearn2Classifier

x = np.random.random(size=(60_000_017, 10))
y = np.asarray([1]*30_000_000 + [2]*30_000_000 + list(range(3,20)))

aml = AutoSklearn2Classifier(time_left_for_this_task=60, memory_limit=10_000)
aml.fit(x, y)
predictions = aml.predict(x)
probabilities = aml.predict_proba(x)

print(probabilities.shape)

(60000017, 5)

Alternatively much slower with the automl benchmark on KDDCup:

python runbenchmark.py autosklearn2:latest openml/t/360112 1h8c -f 5 -m docker -s force

Expected behavior

The number of columns in the probability matrix to match the number of classes in the training data.

(60000017, 19)

Or alternatively a way to tell for which column belongs to which class and for which classes no predictions have been made.

Actual behavior, stacktrace or logfile

(venv) root@486c0ae472af:/bench# python mwe.py
[WARNING] [2021-07-27 16:19:41,000:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Dataset too large for memory limit 10000MB, reducing the precision from float64 to <class 'numpy.float32'>
[WARNING] [2021-07-27 16:19:42,210:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Dataset too large for memory limit 10000MB, reducing number of samples from 60000017 to 13107200.
[WARNING] [2021-07-27 16:19:45,795:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Could not sample dataset in stratified manner, resorting to random sampling
Traceback (most recent call last):
  File "/bench/frameworks/autosklearn/lib/auto-sklearn/autosklearn/automl.py", line 940, in subsample_if_too_large
    stratify=y,
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 2197, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1387, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1715, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/smac/intensification/parallel_scheduling.py:152: UserWarning: SuccessiveHalving is intended to be used with more than 1 worker but num_workers=1
  num_workers
(60000017, 5)

Environment and installation:

Please give details about your installation:

OS: Debian 10 in docker hosted by Windows 10
virtual environment
Python version: 3.7.11
Auto-sklearn version: development (11afae22b8c9a6309d2b6fcf7cfb9a947711cd1e)

The text was updated successfully, but these errors were encountered:

eddiebergman · 2021-08-11T21:46:51Z

Hi @PGijsbers ,

Just letting you know this is addressed in PR #1218 and your error log was very helpful in diagnosing it. It also sheds light on some other potential areas of concern regarding outliers,

PGijsbers · 2021-08-16T09:23:58Z

Glad to hear I could be of help :)

eddiebergman · 2021-09-03T10:43:20Z

Closed as merged with PR #1218

eddiebergman added the bug label Jul 27, 2021

eddiebergman mentioned this issue Aug 11, 2021

Fix probability matrix inconsistent with training input #1218

Merged

eddiebergman closed this as completed Sep 3, 2021

eddiebergman mentioned this issue Apr 24, 2022

Is there another method of finding the best algorithm other than leaderboard #1442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto-sklearn produces probability matrix inconsistent with training input #1190

auto-sklearn produces probability matrix inconsistent with training input #1190

PGijsbers commented Jul 27, 2021

eddiebergman commented Aug 11, 2021 •

edited

Loading

PGijsbers commented Aug 16, 2021

eddiebergman commented Sep 3, 2021

auto-sklearn produces probability matrix inconsistent with training input #1190

auto-sklearn produces probability matrix inconsistent with training input #1190

Comments

PGijsbers commented Jul 27, 2021

Describe the bug

To Reproduce

Expected behavior

Actual behavior, stacktrace or logfile

Environment and installation:

eddiebergman commented Aug 11, 2021 • edited Loading

PGijsbers commented Aug 16, 2021

eddiebergman commented Sep 3, 2021

eddiebergman commented Aug 11, 2021 •

edited

Loading