Skip to content

Data quality "summary score" - possible enhancement #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JosPolfliet opened this issue Apr 5, 2016 · 4 comments
Closed

Data quality "summary score" - possible enhancement #7

JosPolfliet opened this issue Apr 5, 2016 · 4 comments
Labels
feature request 💬 Requests for new features

Comments

@JosPolfliet
Copy link
Contributor

Would anybody see merit in a "data quality summary score"? Meaning a number between 0-100 about the usability of each variable. A normally distributed variable with no missing values would be a 100 and depending on the problems (outliers, skew, missing data, ...) in the data the score lowers.

I have seen this a couple of times in commercial solutions. Any ideas?

@louden
Copy link

louden commented Apr 19, 2016

It is a good idea, but there does not exist a score that would fit all possible types of analyses. For example, for a non-parametric analysis, I may not care about outliers, so I won't want that included in my score function. If you decide to implement it, I would suggest allowing the user to pass the score function as an option

pandas_profiling.ProfileReport(df, score = my_score_function)

with some number of built in choices.

@dartdog
Copy link

dartdog commented Aug 25, 2016

Rather than open another item, I'd like to suggest/enquire about an idea I'm working on (but my skill are not so hot) so hoping maybe someone better than I could pick and run:

I love Profiling and it is now my go to for any new dataset.

What I would really love is the ability to have it auto compare two or three target variables vs the dependent ones(all the others..) so for instance we have a file with males and females we want to compare the age frequency (needs to be as a % of the selected sub group) similarly if we broke down the males and females by the state they live in.., count and normalize and plot... And so on.. A bit tricky for categorical where we need to count and normalize them.. Also might as well do the covariance, and rank the variables by covariance. Would be hugely valuable for initial look sees when beginning to do any machine learning.. Hopefully that makes sense.. I'm slowly piddling with it for a specific file so then maybe I can learn to generalize it..MAybe someone else is way faster and better than I and also wants the same!

@JosPolfliet
Copy link
Contributor Author

Doing target profiling is definitely high on the priority list, see #10

Once that is implemented, it would be easier to add a list of target variables instead of just one.

@sbrugman sbrugman added the feature request 💬 Requests for new features label May 29, 2019
@github-actions
Copy link

Stale issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request 💬 Requests for new features
Projects
None yet
Development

No branches or pull requests

4 participants