PyPMML is making incorrect predictions for XGBoost pipeline

Hi, 

I keep getting incorrect predictions when using the PMML-generated pipeline for inference. The pipeline works correctly in Python when using pipeline.predict_proba(), but after converting it to PMML using sklearn2pmml or using command line to convert pickle to pmml and loading it for prediction using pypmml or jpmml-evaluator-python in python, the predictions do not match the expected Python output.

I have narrowed the issue down to the first step in my four-step pipeline, which is responsible for feature engineering. Here is my first step:

```
def num_exp(x):
    if (x >=0.0 and x!=999.0) or x ==-999999999.0 or x is None: 
        return x
    else: 
        return 999999999.0
    
def special_exp(x):
    if (x < 0.0 or x == 999.0 or x is None):
        return x
    else: 
        return 999999999.0

def create_dataframe_mapper(dataFrame,special_value_cols,categorical_cols_raw):
    transformers = []
    for column in dataFrame.columns:
        if column in special_value_cols:

            special_expr = ExpressionTransformer(Expression("special_exp(X[0])",function_defs = [special_exp]))
            numerical_expr = ExpressionTransformer(Expression("num_exp(X[0])",function_defs = [num_exp]))

            transformers.append(([column], [ContinuousDomain(dtype = float,data_min= -999999999, data_max = 9999999999, missing_value_replacement =  -999999999.0  ),numerical_expr],{"alias":f"{column}_num"}))
            transformers.append(([column], [ContinuousDomain(dtype = float,data_min= -999999999, data_max = 9999999999,  missing_value_replacement =  -999999999.0 ),special_expr],{"alias":f"{column}_special_values"}))
        else:

            passthrough = ExpressionTransformer(expr="X[0]")
            transformers.append(([column], passthrough))

        newcol_mapper = DataFrameMapper(transformers, df_out=True)
    return newcol_mapper,transformers



```

 Specifically, the first step modifies one column, x, and generates two new columns: x_num and x_special_values. The logic that x should follow is shown below and an example of how it should work is also shown in the table below:

x_num: If x >= 0 and x != 999, return x. Else, return 999999999. Leave missing values unchanged.
x_special_values: If x < 0 or x == 999, return x. Else, return 999999999. Leave missing values unchanged.

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 15">
<meta name=Originator content="Microsoft Word 15">
<link rel=File-List
href="file:////Users/shivangi.soni/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">

<link rel=themeData
href="file:////Users/shivangi.soni/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_themedata.thmx">
<link rel=colorSchemeMapping
href="file:////Users/shivangi.soni/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_colorschememapping.xml">

<style>

</style>

</head>

<body lang=EN-CA style='tab-interval:36.0pt;word-wrap:break-word'>



x | x_num | x_special_values
-- | -- | --
1.0 | 1.0 | 999999999
2.0 | 2.0 | 999999999
-1.0 | 999999999 | -1.0
NaN | NaN | NaN
999 | 999999999 | 999




</body>

</html>


For the second part (else condition), where I leave columns unchanged, I wanted to go with dataframe mapper so I just did a workaround to use Expression Transformer to return  column values as they are. 

I have tried the following approach as well to generate new columns, which doesn’t work either. 
```
special_expr = ExpressionTransformer(Expression("special_exp(X[0])",function_defs = [special_exp]),map_missing_to = -999999999.0)
 numerical_expr = ExpressionTransformer(Expression("num_exp(X[0])",function_defs = [num_exp]),map_missing_to = -999999999.0) 

transformers.append(([column], numerical_expr, {"alias":f"{column}_num"}))
transformers.append(([column], special_expr, {"alias":f"{column}_special_values"}))
```
Also tried using inline string instead of UDF but that doesn’t work either. 

This is example of one of the columns in pmml file if that helps. I am not too familiar with PMML so I am not sure if this is working as it is supposed to.

```
<DerivedField name="x_num" optype="continuous" dataType="double">
							<Apply function="if">
								<Apply function="and">
									<Apply function="greaterOrEqual">
										<FieldRef field="x"/>
										<Constant dataType="double">0.0</Constant>
									</Apply>
									<Apply function="notEqual">
										<FieldRef field="x"/>
										<Constant dataType="double">999.0</Constant>
									</Apply>
								</Apply>
								<FieldRef field="x"/>
								<Apply function="if">
									<Apply function="equal">
										<FieldRef field="x"/>
										<Constant missing="true"/>
									</Apply>
									<Constant dataType="double">-9.99999999E8</Constant>
									<Constant dataType="double">9.99999999E8</Constant>

								</Apply>
							</Apply>

```

The rest of my pipeline has two more dataframe mapper steps which use the columns generated in the first step and then use it for predictions as shown below. 
```
initial_mapper, transfomrer1= create_dataframe_mapper(df_train_pipeline, special_value_cols,categorical_cols_raw)
second_mapper, transfomrer2= feature_eng(fteng_cols,all_cols)
final_mapper,transformer4 = create_final_mapper(binning_process, order_cols, all_num_cols2, all_cat_cols2,categorical_variables)
pipeline_mod = PMMLPipeline([

    ("mapper", initial_mapper),
    ("second_mapper", second_mapper),  
    ("final_mapper", final_mapper),
    ("classifier",model)
])

```

My question is where are things going wrong in PMML. When I don't do the first step as part of my pipeline, my pmml and python predictions match. I can see output of first step being correct in python as well. Is there a known issue with how sklearn2pmml translates the transformation logic?  Any guidance or fix would be greatly appreciated.

Environment:
sklearn2pmml version:  0.112.1.post1
pypmml version: 1.5.6
Python version: 3.12.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PyPMML is making incorrect predictions for XGBoost pipeline #456

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

x	x_num	x_special_values
1.0	1.0	999999999
2.0	2.0	999999999
-1.0	999999999	-1.0
NaN	NaN	NaN
999	999999999	999

PyPMML is making incorrect predictions for XGBoost pipeline #456

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions