-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Data quality "summary score" - possible enhancement #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It is a good idea, but there does not exist a score that would fit all possible types of analyses. For example, for a non-parametric analysis, I may not care about outliers, so I won't want that included in my score function. If you decide to implement it, I would suggest allowing the user to pass the score function as an option
with some number of built in choices. |
Rather than open another item, I'd like to suggest/enquire about an idea I'm working on (but my skill are not so hot) so hoping maybe someone better than I could pick and run: I love Profiling and it is now my go to for any new dataset. What I would really love is the ability to have it auto compare two or three target variables vs the dependent ones(all the others..) so for instance we have a file with males and females we want to compare the age frequency (needs to be as a % of the selected sub group) similarly if we broke down the males and females by the state they live in.., count and normalize and plot... And so on.. A bit tricky for categorical where we need to count and normalize them.. Also might as well do the covariance, and rank the variables by covariance. Would be hugely valuable for initial look sees when beginning to do any machine learning.. Hopefully that makes sense.. I'm slowly piddling with it for a specific file so then maybe I can learn to generalize it..MAybe someone else is way faster and better than I and also wants the same! |
Doing target profiling is definitely high on the priority list, see #10 Once that is implemented, it would be easier to add a list of target variables instead of just one. |
Stale issue |
Would anybody see merit in a "data quality summary score"? Meaning a number between 0-100 about the usability of each variable. A normally distributed variable with no missing values would be a 100 and depending on the problems (outliers, skew, missing data, ...) in the data the score lowers.
I have seen this a couple of times in commercial solutions. Any ideas?
The text was updated successfully, but these errors were encountered: