Description
Hi,
I keep getting incorrect predictions when using the PMML-generated pipeline for inference. The pipeline works correctly in Python when using pipeline.predict_proba(), but after converting it to PMML using sklearn2pmml or using command line to convert pickle to pmml and loading it for prediction using pypmml or jpmml-evaluator-python in python, the predictions do not match the expected Python output.
I have narrowed the issue down to the first step in my four-step pipeline, which is responsible for feature engineering. Here is my first step:
def num_exp(x):
if (x >=0.0 and x!=999.0) or x ==-999999999.0 or x is None:
return x
else:
return 999999999.0
def special_exp(x):
if (x < 0.0 or x == 999.0 or x is None):
return x
else:
return 999999999.0
def create_dataframe_mapper(dataFrame,special_value_cols,categorical_cols_raw):
transformers = []
for column in dataFrame.columns:
if column in special_value_cols:
special_expr = ExpressionTransformer(Expression("special_exp(X[0])",function_defs = [special_exp]))
numerical_expr = ExpressionTransformer(Expression("num_exp(X[0])",function_defs = [num_exp]))
transformers.append(([column], [ContinuousDomain(dtype = float,data_min= -999999999, data_max = 9999999999, missing_value_replacement = -999999999.0 ),numerical_expr],{"alias":f"{column}_num"}))
transformers.append(([column], [ContinuousDomain(dtype = float,data_min= -999999999, data_max = 9999999999, missing_value_replacement = -999999999.0 ),special_expr],{"alias":f"{column}_special_values"}))
else:
passthrough = ExpressionTransformer(expr="X[0]")
transformers.append(([column], passthrough))
newcol_mapper = DataFrameMapper(transformers, df_out=True)
return newcol_mapper,transformers
Specifically, the first step modifies one column, x, and generates two new columns: x_num and x_special_values. The logic that x should follow is shown below and an example of how it should work is also shown in the table below:
x_num: If x >= 0 and x != 999, return x. Else, return 999999999. Leave missing values unchanged.
x_special_values: If x < 0 or x == 999, return x. Else, return 999999999. Leave missing values unchanged.
x | x_num | x_special_values |
---|---|---|
1.0 | 1.0 | 999999999 |
2.0 | 2.0 | 999999999 |
-1.0 | 999999999 | -1.0 |
NaN | NaN | NaN |
999 | 999999999 | 999 |
For the second part (else condition), where I leave columns unchanged, I wanted to go with dataframe mapper so I just did a workaround to use Expression Transformer to return column values as they are.
I have tried the following approach as well to generate new columns, which doesn’t work either.
special_expr = ExpressionTransformer(Expression("special_exp(X[0])",function_defs = [special_exp]),map_missing_to = -999999999.0)
numerical_expr = ExpressionTransformer(Expression("num_exp(X[0])",function_defs = [num_exp]),map_missing_to = -999999999.0)
transformers.append(([column], numerical_expr, {"alias":f"{column}_num"}))
transformers.append(([column], special_expr, {"alias":f"{column}_special_values"}))
Also tried using inline string instead of UDF but that doesn’t work either.
This is example of one of the columns in pmml file if that helps. I am not too familiar with PMML so I am not sure if this is working as it is supposed to.
<DerivedField name="x_num" optype="continuous" dataType="double">
<Apply function="if">
<Apply function="and">
<Apply function="greaterOrEqual">
<FieldRef field="x"/>
<Constant dataType="double">0.0</Constant>
</Apply>
<Apply function="notEqual">
<FieldRef field="x"/>
<Constant dataType="double">999.0</Constant>
</Apply>
</Apply>
<FieldRef field="x"/>
<Apply function="if">
<Apply function="equal">
<FieldRef field="x"/>
<Constant missing="true"/>
</Apply>
<Constant dataType="double">-9.99999999E8</Constant>
<Constant dataType="double">9.99999999E8</Constant>
</Apply>
</Apply>
The rest of my pipeline has two more dataframe mapper steps which use the columns generated in the first step and then use it for predictions as shown below.
initial_mapper, transfomrer1= create_dataframe_mapper(df_train_pipeline, special_value_cols,categorical_cols_raw)
second_mapper, transfomrer2= feature_eng(fteng_cols,all_cols)
final_mapper,transformer4 = create_final_mapper(binning_process, order_cols, all_num_cols2, all_cat_cols2,categorical_variables)
pipeline_mod = PMMLPipeline([
("mapper", initial_mapper),
("second_mapper", second_mapper),
("final_mapper", final_mapper),
("classifier",model)
])
My question is where are things going wrong in PMML. When I don't do the first step as part of my pipeline, my pmml and python predictions match. I can see output of first step being correct in python as well. Is there a known issue with how sklearn2pmml translates the transformation logic? Any guidance or fix would be greatly appreciated.
Environment:
sklearn2pmml version: 0.112.1.post1
pypmml version: 1.5.6
Python version: 3.12.7