Skip to content

Introduced dtype_enum to hold additional type metadata #18494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: branch-25.06
Choose a base branch
from

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Apr 14, 2025

Description

This PR introduces dtype_enum, a type enum that represents the true pandas dtype backend. Pandas currently supports 3 dtype backends:

  1. pd.core.dtypes.dtypes.NumpyEADtype
  2. pd.core.dtypes.dtypes.ArrowDtype
  3. pd.core.dtypes.dtypes.ExtensionDtype

If a pandas series with any other the above dtypes is passed to cudf, the dtype_enum will be set accordingly and during the to_pandas conversion the original dtype will be restored using the dtype_enum.

dtype_enum is only functional in pandas-compatibiliy mode.

Fixes: #14149

I plan on opening separate PR's to fix the failures that are being newly unlocked in pandas test-suite after this PR is merged.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Apr 14, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@galipremsagar galipremsagar added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 14, 2025
@github-actions github-actions bot added the Python Affects Python cuDF API. label Apr 14, 2025
@galipremsagar
Copy link
Contributor Author

/okay to test 5f811ba

@galipremsagar
Copy link
Contributor Author

/okay to test e18b431

@galipremsagar galipremsagar added the 3 - Ready for Review Ready for review by team label Apr 16, 2025
@galipremsagar galipremsagar marked this pull request as ready for review April 16, 2025 18:50
@galipremsagar galipremsagar requested a review from a team as a code owner April 16, 2025 18:50
@galipremsagar galipremsagar requested review from bdice and Matt711 April 16, 2025 18:50
@galipremsagar galipremsagar changed the title dtype_enum wip Introduced dtype_enum to hold additional type metadata Apr 16, 2025
@@ -783,9 +785,9 @@ def to_pandas(
nullable: bool = False,
arrow_type: bool = False,
) -> pd.Index:
if nullable:
if arrow_type or self.dtype_enum in {2}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would only want self.dtype_enum to influence to_pandas in pandas compatible mode, correct? Otherwise, a non cudf.pandas user who calls e.g. to_pandas(arrow_type=False) may still get an Arrow dtype pandas object back

}


def get_dtype_enum(dtype: Dtype) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be nice to use an enum.Enum to represent this so at least in the code we could check e.g. PandasTypeEnum.ARROW or something

DTYPE_ENUM_MAP = {
PANDAS_NUMPY_DTYPE: 1,
pd.core.dtypes.dtypes.ArrowDtype: 2,
pd.core.dtypes.dtypes.ExtensionDtype: 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I don't think we should be supporting any arbitrary ExtensionDtype at this point - this would include 3rd party custom ExtensionDtype

@mroeschke
Copy link
Contributor

I know Ashwin (and partially myself) had some reservation about introducing another attribute that we would need to keep in sync. From the review:

Instead of introducing a new argument here, can we use the existing dtype argument and inspect whether it is a numpy, Arrow, or extension dtype?

Did you happen to explore if using dtype was feasible here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[FEA] Ability to round-trip all pandas columns dtypes
2 participants