Skip to content

PyPMML is making incorrect predictions for XGBoost pipeline #456

Open
@shivangi-ph

Description

@shivangi-ph

Hi,

I keep getting incorrect predictions when using the PMML-generated pipeline for inference. The pipeline works correctly in Python when using pipeline.predict_proba(), but after converting it to PMML using sklearn2pmml or using command line to convert pickle to pmml and loading it for prediction using pypmml or jpmml-evaluator-python in python, the predictions do not match the expected Python output.

I have narrowed the issue down to the first step in my four-step pipeline, which is responsible for feature engineering. Here is my first step:

def num_exp(x):
    if (x >=0.0 and x!=999.0) or x ==-999999999.0 or x is None: 
        return x
    else: 
        return 999999999.0
    
def special_exp(x):
    if (x < 0.0 or x == 999.0 or x is None):
        return x
    else: 
        return 999999999.0

def create_dataframe_mapper(dataFrame,special_value_cols,categorical_cols_raw):
    transformers = []
    for column in dataFrame.columns:
        if column in special_value_cols:

            special_expr = ExpressionTransformer(Expression("special_exp(X[0])",function_defs = [special_exp]))
            numerical_expr = ExpressionTransformer(Expression("num_exp(X[0])",function_defs = [num_exp]))

            transformers.append(([column], [ContinuousDomain(dtype = float,data_min= -999999999, data_max = 9999999999, missing_value_replacement =  -999999999.0  ),numerical_expr],{"alias":f"{column}_num"}))
            transformers.append(([column], [ContinuousDomain(dtype = float,data_min= -999999999, data_max = 9999999999,  missing_value_replacement =  -999999999.0 ),special_expr],{"alias":f"{column}_special_values"}))
        else:

            passthrough = ExpressionTransformer(expr="X[0]")
            transformers.append(([column], passthrough))

        newcol_mapper = DataFrameMapper(transformers, df_out=True)
    return newcol_mapper,transformers



Specifically, the first step modifies one column, x, and generates two new columns: x_num and x_special_values. The logic that x should follow is shown below and an example of how it should work is also shown in the table below:

x_num: If x >= 0 and x != 999, return x. Else, return 999999999. Leave missing values unchanged.
x_special_values: If x < 0 or x == 999, return x. Else, return 999999999. Leave missing values unchanged.

<style> </style>
x x_num x_special_values
1.0 1.0 999999999
2.0 2.0 999999999
-1.0 999999999 -1.0
NaN NaN NaN
999 999999999 999

For the second part (else condition), where I leave columns unchanged, I wanted to go with dataframe mapper so I just did a workaround to use Expression Transformer to return column values as they are.

I have tried the following approach as well to generate new columns, which doesn’t work either.

special_expr = ExpressionTransformer(Expression("special_exp(X[0])",function_defs = [special_exp]),map_missing_to = -999999999.0)
 numerical_expr = ExpressionTransformer(Expression("num_exp(X[0])",function_defs = [num_exp]),map_missing_to = -999999999.0) 

transformers.append(([column], numerical_expr, {"alias":f"{column}_num"}))
transformers.append(([column], special_expr, {"alias":f"{column}_special_values"}))

Also tried using inline string instead of UDF but that doesn’t work either.

This is example of one of the columns in pmml file if that helps. I am not too familiar with PMML so I am not sure if this is working as it is supposed to.

<DerivedField name="x_num" optype="continuous" dataType="double">
							<Apply function="if">
								<Apply function="and">
									<Apply function="greaterOrEqual">
										<FieldRef field="x"/>
										<Constant dataType="double">0.0</Constant>
									</Apply>
									<Apply function="notEqual">
										<FieldRef field="x"/>
										<Constant dataType="double">999.0</Constant>
									</Apply>
								</Apply>
								<FieldRef field="x"/>
								<Apply function="if">
									<Apply function="equal">
										<FieldRef field="x"/>
										<Constant missing="true"/>
									</Apply>
									<Constant dataType="double">-9.99999999E8</Constant>
									<Constant dataType="double">9.99999999E8</Constant>

								</Apply>
							</Apply>

The rest of my pipeline has two more dataframe mapper steps which use the columns generated in the first step and then use it for predictions as shown below.

initial_mapper, transfomrer1= create_dataframe_mapper(df_train_pipeline, special_value_cols,categorical_cols_raw)
second_mapper, transfomrer2= feature_eng(fteng_cols,all_cols)
final_mapper,transformer4 = create_final_mapper(binning_process, order_cols, all_num_cols2, all_cat_cols2,categorical_variables)
pipeline_mod = PMMLPipeline([

    ("mapper", initial_mapper),
    ("second_mapper", second_mapper),  
    ("final_mapper", final_mapper),
    ("classifier",model)
])

My question is where are things going wrong in PMML. When I don't do the first step as part of my pipeline, my pmml and python predictions match. I can see output of first step being correct in python as well. Is there a known issue with how sklearn2pmml translates the transformation logic? Any guidance or fix would be greatly appreciated.

Environment:
sklearn2pmml version: 0.112.1.post1
pypmml version: 1.5.6
Python version: 3.12.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions