Description
I'm trying to take advantage of the datetime functionality presented here https://openscoring.io/blog/2020/03/08/sklearn_date_datetime_pmml/ which works great for datetime fields that are always populated.
For each sample in my data I have the datetime the sample was created, then a historic datetime for an event related to this sample that may or may not have happened. I would like to calculate a feature that is the difference between these timestamps if both are present, but null if the historic event hasn't happened.
I'm currently using this mapper config:
def duration_transformer():
return ExpressionTransformer("(X[0] - X[1])/(60*60*24)", dtype=float)
memory = Memory()
mapper = DataFrameMapper(
# first one is list comprehension, for each column in cat_columns it will map to categorical domain and then label encode the category
[
(["datetime"], [DateTimeDomain(), make_memorizer_union(memory, names=["memorized_datetime"]), SecondsSinceMidnightTransformer(), Alias(make_hour_of_day_transformer(), "HourOfDay", prefit = False)], {'alias':'hour_of_day'}),
(["historic_event"], [DateTimeDomain(), make_recaller_union(memory, names=["memorized_datetime"]), SecondsSinceYearTransformer(year = 1900), Alias(duration_transformer(), "days_since_historic_event", prefit = False)], {'alias':'days_since_historic_event'}),
], input_df=False, df_out=True
)
When I attempt to fit_transform, I get an error because the SecondsSinceYearTransformer is receiving some NaT values, and the DurationTransformer class attempts to cast whatever value it gets to int, which fails:
IntCastingNaNError: ['historic_event']: Cannot convert non-finite values (NA or inf) to integer
Is there a functional reason why the SecondsSinceYearTransformer doesn't have missing/invalid treatment options like other transformers? Ideally I'd be able to tell it to just pass through missing values and return a null that LGBM is capable of handling, although I assume I'd then have to updated my duration_transformer() to understand what to do with null values