Skip to content

Enabling numeric feature discretization #179

@alpetukhov

Description

@alpetukhov

Dear ydf authors,

I'm running ydf version 0.8.0 (latest available at the moment) on Windows 11 and have trouble enabling the discretization of the numeric features in the local (non-distributed) training.
"How to train a model faster" page suggests that the automatic discretization can be turned on for all features with discretize_numerical_columns=True. But when I use it as an argument for the GradientBoostedTreesLearner I get no changes in neither the training speed nor the model performance even if I set num_discretized_numerical_bins=2. All the features in the ydf logs are also said to be NUMERICAL and no DISCRETIZED_NUMERICAL.
"How to define model features" also suggests that ydf.Semantic.DISCRETIZED_NUMERICAL can be used to force the discretization. However if I pass the feature name and sematntic tuple to the features option of the GradientBoostedTreesLearner I get the following error

ValueError: Cannot import column 'XXX' with semantic=Semantic.DISCRETIZED_NUMERICAL, type=numpy's array of 'float64' and content=array(XXX)

What is the correct way to turn on the on-the-fly discretization?

Also, the GradientBoostedTreeLearner API reference has mentions of some columns parameter, e.g. in features

"If include_all_columns=True, all the columns are imported as features and only the semantic of the columns NOT in columns is determined automatically

but this parameter is not described anywhere. Can you please explain what is this parameter and what is it used for?

Best regards,
Aleksandr

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions