Skip to content

Support for customizing missing/invalid value handling across all customer Transformer classes (similar to what's already available in ExpressionTransformer) #438

Open
@philip-bingham

Description

@philip-bingham

I'm trying to take advantage of the datetime functionality presented here https://openscoring.io/blog/2020/03/08/sklearn_date_datetime_pmml/ which works great for datetime fields that are always populated.

For each sample in my data I have the datetime the sample was created, then a historic datetime for an event related to this sample that may or may not have happened. I would like to calculate a feature that is the difference between these timestamps if both are present, but null if the historic event hasn't happened.

I'm currently using this mapper config:

def duration_transformer():
    return ExpressionTransformer("(X[0] - X[1])/(60*60*24)", dtype=float)

memory = Memory()

mapper = DataFrameMapper(
# first one is list comprehension, for each column in cat_columns it will map to categorical domain and then label encode the category
              [
                 (["datetime"], [DateTimeDomain(), make_memorizer_union(memory, names=["memorized_datetime"]), SecondsSinceMidnightTransformer(), Alias(make_hour_of_day_transformer(), "HourOfDay", prefit = False)], {'alias':'hour_of_day'}),
                 (["historic_event"], [DateTimeDomain(), make_recaller_union(memory, names=["memorized_datetime"]), SecondsSinceYearTransformer(year = 1900), Alias(duration_transformer(), "days_since_historic_event", prefit = False)], {'alias':'days_since_historic_event'}),
          
                 
              ], input_df=False, df_out=True
                )

When I attempt to fit_transform, I get an error because the SecondsSinceYearTransformer is receiving some NaT values, and the DurationTransformer class attempts to cast whatever value it gets to int, which fails:

IntCastingNaNError: ['historic_event']: Cannot convert non-finite values (NA or inf) to integer

Is there a functional reason why the SecondsSinceYearTransformer doesn't have missing/invalid treatment options like other transformers? Ideally I'd be able to tell it to just pass through missing values and return a null that LGBM is capable of handling, although I assume I'd then have to updated my duration_transformer() to understand what to do with null values

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions