Skip to content

Dedicated attribute/method access to list of fields of a DataFrameModel #1286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nathanjmcdougall opened this issue Aug 4, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@nathanjmcdougall
Copy link
Contributor

nathanjmcdougall commented Aug 4, 2023

Is your feature request related to a problem? Please describe.
It is often helpful to have a list of all fields/columns associated with a DataFrameModel.

For example, to easily set a consistent order of the columns of a DataFrame (especially if we want to validate this with ordered = True):

class MyDFModel(DataFrameModel):
    abc: Series[int]
    xyz: Series[int]

ALL_COLUMNS = [MyDFModel.abc, MyDFModel.xyz]

df = df[ALL_COLUMNS].copy()

Other cases that are coming up for me are using multiple columns to aggregate results or perform joins:

df.groupby(GROUP_COLS).sum()
df1.merge(df2, on=SHARED_COLS)

A final case is for column checks:

if "my_col" in MyDFModel:
    ...

It would be nice not to have to manually enumerate all the fields of MyDFModel. Instead, to automatically obtain to obtain this information, it seems like the most convenient way is currently as follows:

ALL_COLUMNS = list(MyDFModel.to_schema().columns)

This is quite verbose.

Describe the solution you'd like
I propose to allow iteration over the DataFrameModel directly to give each column, e.g. as list(MyDFModel), or as in for col in MyDFModel.

Currently, DataFrameModel subclass instances are not iterable.

Some intuition behind this is that since pd.DataFrame.abc returns the column itself df["abc"], and MyDFModel.abc returns the column name "abc", by analogy we might expect that the unmodified dataframe df (all columns) should correspond to unmodified MyDFModel (an iterable of all column names).

Describe alternatives you've considered
Another possibility is a function e.g. maybe pa.columns:

def columns(dfm: type[DataFrameModel]) -> list[str]:
    return list(dfm.to_schema().columns)

I think naming this function would need some thought (I don't like the idea of having a function called columns in the name space because I think it's likely to be a variable name often enough). This is fairly clear:

df = df[columns(MyDFModel)].copy()

Another similar possibility is a DataFrameModel.columns property which exposes a list of the fields associated with the DataFrameModel. I think this would be a good solution too. Since pd.DataFrame already has a columns attribute, it should not cause particularly problematic name collisions.

A workaround hinted at by jeffzi in #364 (comment)_ is to run MyDFModel.to_schema() to set the cache (immediately after definition of MyDFModel?) and then use list(MyDFModel.__fields__). But this is still fairly verbose and feels a bit fragile to remember to set the cache like this.

Similarly, in some cases a workaround is to access list(DataFrameModel.__annotations__). This is still fairly verbose, and moreover this does not include any fields inherited from parent classes if the DataFrameModel in question is a subclass of another one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant