Skip to content

auto-sklearn produces probability matrix inconsistent with training input #1190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
PGijsbers opened this issue Jul 27, 2021 · 3 comments
Closed
Labels

Comments

@PGijsbers
Copy link
Contributor

Describe the bug

When the dataset has outliers and is big enough to be subsampled, it can produce a probability matrix which has fewer columns than classes in the training data.

To Reproduce

import numpy as np
from autosklearn.experimental.askl2 import AutoSklearn2Classifier

x = np.random.random(size=(60_000_017, 10))
y = np.asarray([1]*30_000_000 + [2]*30_000_000 + list(range(3,20)))

aml = AutoSklearn2Classifier(time_left_for_this_task=60, memory_limit=10_000)
aml.fit(x, y)
predictions = aml.predict(x)
probabilities = aml.predict_proba(x)

print(probabilities.shape)

(60000017, 5)

Alternatively much slower with the automl benchmark on KDDCup:

python runbenchmark.py autosklearn2:latest openml/t/360112 1h8c -f 5 -m docker -s force

Expected behavior

The number of columns in the probability matrix to match the number of classes in the training data.

(60000017, 19)

Or alternatively a way to tell for which column belongs to which class and for which classes no predictions have been made.

Actual behavior, stacktrace or logfile

(venv) root@486c0ae472af:/bench# python mwe.py
[WARNING] [2021-07-27 16:19:41,000:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Dataset too large for memory limit 10000MB, reducing the precision from float64 to <class 'numpy.float32'>
[WARNING] [2021-07-27 16:19:42,210:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Dataset too large for memory limit 10000MB, reducing number of samples from 60000017 to 13107200.
[WARNING] [2021-07-27 16:19:45,795:Client-AutoML(1):6d574018-eef6-11eb-9953-0242ac110004] Could not sample dataset in stratified manner, resorting to random sampling
Traceback (most recent call last):
  File "/bench/frameworks/autosklearn/lib/auto-sklearn/autosklearn/automl.py", line 940, in subsample_if_too_large
    stratify=y,
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 2197, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1387, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1715, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
/bench/frameworks/autosklearn/venv/lib/python3.7/site-packages/smac/intensification/parallel_scheduling.py:152: UserWarning: SuccessiveHalving is intended to be used with more than 1 worker but num_workers=1
  num_workers
(60000017, 5)

Environment and installation:

Please give details about your installation:

  • OS: Debian 10 in docker hosted by Windows 10
  • virtual environment
  • Python version: 3.7.11
  • Auto-sklearn version: development (11afae22b8c9a6309d2b6fcf7cfb9a947711cd1e)
@eddiebergman
Copy link
Contributor

eddiebergman commented Aug 11, 2021

Hi @PGijsbers ,

Just letting you know this is addressed in PR #1218 and your error log was very helpful in diagnosing it. It also sheds light on some other potential areas of concern regarding outliers,

@PGijsbers
Copy link
Contributor Author

Glad to hear I could be of help :)

@eddiebergman
Copy link
Contributor

Closed as merged with PR #1218

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants