-
-
Notifications
You must be signed in to change notification settings - Fork 336
Create empty dataframe from schema #992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You can do |
Thanks Neils for taking the time to respond. I have tried
for a column defined as such: created_at: Optional[Series[Annotated[pd.DatetimeTZDtype, "ns", "UTC"]]] # type: ignore |
ah, I don't think pandera strategies support do you mind opening up a feature request? Here's a recipe for creating an empty dataframe without the data synthesis strategies: from typing import Annotated
import pandas as pd
import pandera as pa
from pandera.typing import Series
class Schema(pa.SchemaModel):
col1: Series[int]
col2: Series[float]
col3: Series[str]
col4: Series[pd.Timestamp]
col5: Series[Annotated[pd.DatetimeTZDtype, "ns", "UTC"]]
dtypes = {k: str(v) for k, v in Schema.to_schema().dtypes.items()}
empty_df = pd.DataFrame(columns=[*dtypes]).astype(dtypes)
print(empty_df)
print(empty_df.dtypes)
# Output:
# Empty DataFrame
# Columns: [col1, col2, col3, col4, col5]
# Index: []
# col1 int64
# col2 float64
# col3 object
# col4 datetime64[ns]
# col5 datetime64[ns, UTC]
# dtype: object Does this work for you? |
Thanks for the suggestion! Yes, this does work. I was able to combine it with a trick I learned from PEP 673 to come up with the following solution which works for Python 3.10 and plays nicely with mypy: SchemaType = TypeVar("SchemaType", bound="MySchemaModel")
class MySchemaModel(pa.SchemaModel):
"""
Provides a `pandera.SchemaModel` with convenience function for generating empty
dataframes that fit the schema.
"""
@classmethod
def empty(cls) -> DataFrame[SchemaType]:
dtypes = {k: str(v) for k, v in cls.to_schema().dtypes.items()}
empty_df = pd.DataFrame(columns=[*dtypes]).astype(dtypes)
return DataFrame[SchemaType](empty_df) Is this something that we could add to Pandera itself? I'm sure a |
Great! Yes would welcome a PR on this. One note on the approach: we should add an |
Hi, I like the feature and got impatient, so I started working on a PR. A few question @cosmicBboy , while writing tests I ran into some issues. These types can't by instantiated as a dtype during the
And these two failed a subsequent
I'll look into them a bit more, but I'm hoping you could tell me right away what the issue might be. edit: I built my test-schema like this: schema = pandera.DataFrameSchema(columns={
"pandera.dtypes.DataType": pandera.Column(pandera.dtypes.DataType),
"pandera.dtypes._Number": pandera.Column(pandera.dtypes._Number),
"pandera.dtypes._PhysicalNumber": pandera.Column(pandera.dtypes._PhysicalNumber),
"pandera.dtypes.Int": pandera.Column(pandera.dtypes.Int),
"pandera.dtypes.Int64": pandera.Column(pandera.dtypes.Int64),
... etc
}) I hope that's how I'm supposed to do it. |
Thanks @a-recknagel ! Not sure what implementation you opted for, but revisiting the code snippet I posted above, I think a more robust approach would be: schema = Schema.to_schema()
schema.coerce = True
empty_df = schema.coerce_dtype(pd.DataFrame(columns=[*schema.columns])) This piggy-backs on pandera's coercion logic, and you should be able to use the pandera
I'd ignore the abstract
|
I created a PR from my in-progress branch, changing the way the empty dataframe is created to leverage coercion didn't seem to change the failing cases.
You mean within the |
Signed-off-by: Arne Recknagel <[email protected]>
I know this is an old thread, but I came across it and it mostly worked for me except it wasn't preserving an index field. I added a line
|
@ssuffian My time is bound up elsewhere so I can't review this right now. If you want, you can cherry pick the changes from my branch and take over though. |
Question about pandera
I need to be able to create an empty dataframe and (maybe) populate it later. I hoped this would be fairly straight forward like
np.empty(...)
but so far the best way I have found is to write anempty()
method for eachSchemaModel
I have that explicitly creates apd.DataFrame
with manually-maintained columns and dtypes. Have I overlooked something?The text was updated successfully, but these errors were encountered: