Skip to content

Create empty dataframe from schema #992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Davidkloving opened this issue Oct 31, 2022 · 10 comments
Open

Create empty dataframe from schema #992

Davidkloving opened this issue Oct 31, 2022 · 10 comments
Labels
enhancement New feature or request

Comments

@Davidkloving
Copy link

Question about pandera

I need to be able to create an empty dataframe and (maybe) populate it later. I hoped this would be fairly straight forward like np.empty(...) but so far the best way I have found is to write an empty() method for each SchemaModel I have that explicitly creates a pd.DataFrame with manually-maintained columns and dtypes. Have I overlooked something?

@Davidkloving Davidkloving added the question Further information is requested label Oct 31, 2022
@cosmicBboy
Copy link
Collaborator

You can do SchemaModel.example(size=0) to create an empty dataframe, via data synthesis strategies

@Davidkloving
Copy link
Author

Davidkloving commented Nov 1, 2022

Thanks Neils for taking the time to respond.

I have tried .example(size=0) but I was hoping to accomplish this without introducing hypothesis as a dependency. For some reason it makes our tests very slow to start, sometimes hang, and surprisingly produces the following error:

pandera.errors.SchemaError: expected series 'created_at' to have type datetime64[ns, UTC], got datetime64[ns]

for a column defined as such:

created_at: Optional[Series[Annotated[pd.DatetimeTZDtype, "ns", "UTC"]]]  # type: ignore

@cosmicBboy
Copy link
Collaborator

ah, I don't think pandera strategies support pd.DatetimeTZDtype yet.

do you mind opening up a feature request?

Here's a recipe for creating an empty dataframe without the data synthesis strategies:

from typing import Annotated
import pandas as pd
import pandera as pa
from pandera.typing import Series


class Schema(pa.SchemaModel):
    col1: Series[int]
    col2: Series[float]
    col3: Series[str]
    col4: Series[pd.Timestamp]
    col5: Series[Annotated[pd.DatetimeTZDtype, "ns", "UTC"]]


dtypes = {k: str(v) for k, v in Schema.to_schema().dtypes.items()}
empty_df = pd.DataFrame(columns=[*dtypes]).astype(dtypes)

print(empty_df)
print(empty_df.dtypes)

# Output:
# Empty DataFrame
# Columns: [col1, col2, col3, col4, col5]
# Index: []
# col1                  int64
# col2                float64
# col3                 object
# col4         datetime64[ns]
# col5    datetime64[ns, UTC]
# dtype: object

Does this work for you?

@Davidkloving
Copy link
Author

Thanks for the suggestion! Yes, this does work. I was able to combine it with a trick I learned from PEP 673 to come up with the following solution which works for Python 3.10 and plays nicely with mypy:

SchemaType = TypeVar("SchemaType", bound="MySchemaModel")


class MySchemaModel(pa.SchemaModel):
    """
    Provides a `pandera.SchemaModel` with convenience function for generating empty
    dataframes that fit the schema.
    """

    @classmethod
    def empty(cls) -> DataFrame[SchemaType]:
        dtypes = {k: str(v) for k, v in cls.to_schema().dtypes.items()}
        empty_df = pd.DataFrame(columns=[*dtypes]).astype(dtypes)
        return DataFrame[SchemaType](empty_df)

Is this something that we could add to Pandera itself? I'm sure a .empty() would be useful to many people.

@cosmicBboy
Copy link
Collaborator

Great! Yes would welcome a PR on this. One note on the approach: we should add an empty method to DataFrameSchema, which implements basically the first 2 lines of your empty method, which would basically be used by SchemaModel.empty to created the typed DataFrame[SchemaType] dataframe. Converting this issue to an enhancement

@cosmicBboy cosmicBboy added enhancement New feature or request and removed question Further information is requested labels Nov 7, 2022
@a-recknagel
Copy link
Contributor

a-recknagel commented Nov 29, 2022

Hi, I like the feature and got impatient, so I started working on a PR. A few question @cosmicBboy , while writing tests I ran into some issues.

These types can't by instantiated as a dtype during the astype call, at least some of them due to being too abstract. But can all of these be safely dropped from the test?

pandera.dtypes.DataType
pandera.dtypes._Number
pandera.dtypes._PhysicalNumber
pandera.engines.numpy_engine.DataType
pandera.engines.pandas_engine.DataType
pandera.engines.pandas_engine.Period
pandera.engines.pandas_engine.Interval
pandera.engines.pandas_engine.PydanticModel

And these two failed a subsequent validate call by the schema that defined the dtype:

  • pandera.engines.numpy_engine.DateTime64 -- Expected type datetime64, got datetime64[ns]
  • pandera.engines.numpy_engine.Bytes -- Data type 'bytes8' not understood

I'll look into them a bit more, but I'm hoping you could tell me right away what the issue might be.


edit: I built my test-schema like this:

schema = pandera.DataFrameSchema(columns={
    "pandera.dtypes.DataType": pandera.Column(pandera.dtypes.DataType),
    "pandera.dtypes._Number": pandera.Column(pandera.dtypes._Number),
    "pandera.dtypes._PhysicalNumber": pandera.Column(pandera.dtypes._PhysicalNumber),
    "pandera.dtypes.Int": pandera.Column(pandera.dtypes.Int),
    "pandera.dtypes.Int64": pandera.Column(pandera.dtypes.Int64),
    ... etc
})

I hope that's how I'm supposed to do it.

@cosmicBboy
Copy link
Collaborator

Thanks @a-recknagel !

Not sure what implementation you opted for, but revisiting the code snippet I posted above, I think a more robust approach would be:

schema = Schema.to_schema()
schema.coerce = True
empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns]))

This piggy-backs on pandera's coercion logic, and you should be able to use the pandera DataType subclasses in your test.

But can all of these be safely dropped from the test?

I'd ignore the abstract DataTypes, (basically test dtypes supported by pandas_engine... pandas-supported types like Period and Interval should be included.

PydanticModel is an interesting case, I'm not sure, but I don't think it'll work given that it needs rows to make coercion work. I'd recommend special-casing a TypeError when using empty() with the PydanticModel for now.

@a-recknagel
Copy link
Contributor

I created a PR from my in-progress branch, changing the way the empty dataframe is created to leverage coercion didn't seem to change the failing cases.

PydanticModel is an interesting case, I'm not sure, but I don't think it'll work given that it needs rows to make coercion work. I'd recommend special-casing a TypeError when using empty() with the PydanticModel for now.

You mean within the empty function, right? I'll try tomorrow.

a-recknagel pushed a commit to a-recknagel/pandera that referenced this issue Nov 30, 2022
Signed-off-by: Arne Recknagel <[email protected]>
a-recknagel pushed a commit to a-recknagel/pandera that referenced this issue Jan 30, 2023
@ssuffian
Copy link

I know this is an old thread, but I came across it and it mostly worked for me except it wasn't preserving an index field. I added a line index=pd.Index([],name=schema.index.name, dtype=schema.index.dtype.type) to get it to work:

class TestDf(pa.DataFrameModel):
    dt: Index[datetime] = pa.Field(check_name=True)
    col1: Series[int]
    col2: Series[int]
    col3: Series[int]

    @classmethod
    def empty(cls):
        schema = cls.to_schema()
        schema.coerce = True
        index = pd.Index([],name=schema.index.name, dtype=schema.index.dtype.type)
        empty_df =schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns], index=index))
        return empty_df

@a-recknagel
Copy link
Contributor

@ssuffian My time is bound up elsewhere so I can't review this right now. If you want, you can cherry pick the changes from my branch and take over though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants