Pandas nullable String dtype is not recognized as a Pandera String #1054

gwerbin-tive · 2022-12-12T23:30:41Z

gwerbin-tive
Dec 12, 2022

Describe the bug

A column containing the 'string' dtype (i.e. pandas.StringDtype) is not valid for the Pandera semantic tring data type, pandera.dtypes.String.

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandera.
(optional) I have confirmed this bug exists on the master branch of pandera.

I attempted to search for this issue, but might have missed it. Sorry if I did!

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    ['color': pa.Column(pa.dtypes.String)]
)

data = pd.Series(['red', 'green', 'blue'], dtype='string').to_frame('color')

schema.validate(data)

Expected behavior

I expected this schema to pass validation successfully.

Desktop (please complete the following information):

MacOS 12.4 (ARM64)
Pandera v0.13.4

Answered by cosmicBboy

Dec 13, 2022

hi @gwerbin-tive, you'll need to use pandera.STRING here, since you want to use the pandera-native string type. See here for all the dtype aliases defined by pandera (pandera.String is the numpy string type).

In general the recommended way of doing this is to use the pd.StringDtype() directly or use the string alias "string". If you want to use the pandera datatype use pandera.STRING, which is just an alias of this.

Need to work on better datatype docs!

View full answer

cosmicBboy · 2022-12-13T15:54:21Z

cosmicBboy
Dec 13, 2022
Maintainer

hi @gwerbin-tive, you'll need to use pandera.STRING here, since you want to use the pandera-native string type. See here for all the dtype aliases defined by pandera (pandera.String is the numpy string type).

In general the recommended way of doing this is to use the pd.StringDtype() directly or use the string alias "string". If you want to use the pandera datatype use pandera.STRING, which is just an alias of this.

Need to work on better datatype docs!

0 replies

gwerbin-tive · 2022-12-14T02:18:40Z

gwerbin-tive
Dec 14, 2022
Author

Thank you for clarifying @cosmicBboy.

I was hoping to support either one of pandas.StringDtype or "object"-containing-str as valid inputs. Does the "string" alias support that?

Also, I am somewhat surprised that pandera.dtypes.String does not (is not intended to?) support pd.StringDtype. I assume there's a good technical reason for that.

0 replies

cosmicBboy · 2022-12-14T15:24:48Z

cosmicBboy
Dec 14, 2022
Maintainer

so pandas.dtypes.String is an alias for the str numpy type (i.e. object). pandas.engines.pandas_engine.STRING is the pandas-native string type.

This is simply by definition in the pandera API.

Similarly, pandas.dtypes.Int maps to the numpy integer type, while pandas.engines.pandas_engine.INT64 maps to the pandas-native nullable integer type.

0 replies

cosmicBboy · 2022-12-14T15:27:24Z

cosmicBboy
Dec 14, 2022
Maintainer

I was hoping to support either one of pandas.StringDtype or "object"-containing-str as valid inputs. Does the "string" alias support that?

In pandera one needs to be precise about the types. The pandera schema follow whatever conventions are set by the underlying framework (pandas in this case). So the "string" alias literally is just an alias for pd.StringDtype(), and "str" or str is just an alias for the numpy string type.

You'll have to create your own custom data type if you want to define a string type that can either be a pandas-native StringDtype OR a numpy str (object) dtype.

1 reply

gwerbin-tive Dec 22, 2022
Author

Thanks for the clarification, but I was surprised that the Pandera "semantic" types are intended to be 1:1 mappings with underlying physical data types.

Based on my reading of the documentation and of their designation as "semantic", I thought would they have the exact opposite property: to be abstract semantic types, without correspondence to any physical storage type. So the Pandera string type would represent all valid string data, the integer type would represent all valid integer data, etc.

cosmicBboy · 2022-12-14T15:27:59Z

cosmicBboy
Dec 14, 2022
Maintainer

Converting this issue to a discussion. @gwerbin-tive would you mind marking the appropriate response as the answer?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Pandas nullable String dtype is not recognized as a Pandera String #1054

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 5 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Pandas nullable String dtype is not recognized as a Pandera String #1054

Uh oh!

Uh oh!

gwerbin-tive Dec 12, 2022

Code Sample, a copy-pastable example

Expected behavior

Desktop (please complete the following information):

Replies: 5 comments · 1 reply

Uh oh!

cosmicBboy Dec 13, 2022 Maintainer

Uh oh!

gwerbin-tive Dec 14, 2022 Author

Uh oh!

cosmicBboy Dec 14, 2022 Maintainer

Uh oh!

cosmicBboy Dec 14, 2022 Maintainer

Uh oh!

gwerbin-tive Dec 22, 2022 Author

Uh oh!

cosmicBboy Dec 14, 2022 Maintainer

gwerbin-tive
Dec 12, 2022

Replies: 5 comments 1 reply

cosmicBboy
Dec 13, 2022
Maintainer

gwerbin-tive
Dec 14, 2022
Author

cosmicBboy
Dec 14, 2022
Maintainer

cosmicBboy
Dec 14, 2022
Maintainer

gwerbin-tive Dec 22, 2022
Author

cosmicBboy
Dec 14, 2022
Maintainer