Data quality "summary score" - possible enhancement #7

JosPolfliet · 2016-04-05T12:27:50Z

Would anybody see merit in a "data quality summary score"? Meaning a number between 0-100 about the usability of each variable. A normally distributed variable with no missing values would be a 100 and depending on the problems (outliers, skew, missing data, ...) in the data the score lowers.

I have seen this a couple of times in commercial solutions. Any ideas?

louden · 2016-04-19T19:43:25Z

It is a good idea, but there does not exist a score that would fit all possible types of analyses. For example, for a non-parametric analysis, I may not care about outliers, so I won't want that included in my score function. If you decide to implement it, I would suggest allowing the user to pass the score function as an option

pandas_profiling.ProfileReport(df, score = my_score_function)

with some number of built in choices.

dartdog · 2016-08-25T16:15:26Z

Rather than open another item, I'd like to suggest/enquire about an idea I'm working on (but my skill are not so hot) so hoping maybe someone better than I could pick and run:

I love Profiling and it is now my go to for any new dataset.

What I would really love is the ability to have it auto compare two or three target variables vs the dependent ones(all the others..) so for instance we have a file with males and females we want to compare the age frequency (needs to be as a % of the selected sub group) similarly if we broke down the males and females by the state they live in.., count and normalize and plot... And so on.. A bit tricky for categorical where we need to count and normalize them.. Also might as well do the covariance, and rank the variables by covariance. Would be hugely valuable for initial look sees when beginning to do any machine learning.. Hopefully that makes sense.. I'm slowly piddling with it for a specific file so then maybe I can learn to generalize it..MAybe someone else is way faster and better than I and also wants the same!

JosPolfliet · 2016-08-25T21:33:43Z

Doing target profiling is definitely high on the priority list, see #10

Once that is implemented, it would be easier to add a list of target variables instead of just one.

github-actions · 2020-02-17T00:01:27Z

Stale issue

sbrugman added the feature request 💬 Requests for new features label May 29, 2019

github-actions bot added the no-issue-activity label Feb 17, 2020

github-actions bot closed this as completed Feb 25, 2020

BioComSoftware mentioned this issue Sep 7, 2023

Bug ReportMemoryError: Unable to allocate 20.0 PiB for an array with shape (2817560004071633,) and data type float64 #1435

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data quality "summary score" - possible enhancement #7

Data quality "summary score" - possible enhancement #7

JosPolfliet commented Apr 5, 2016

louden commented Apr 19, 2016 •

edited

Loading

dartdog commented Aug 25, 2016

JosPolfliet commented Aug 25, 2016

github-actions bot commented Feb 17, 2020

Data quality "summary score" - possible enhancement #7

Data quality "summary score" - possible enhancement #7

Comments

JosPolfliet commented Apr 5, 2016

louden commented Apr 19, 2016 • edited Loading

dartdog commented Aug 25, 2016

JosPolfliet commented Aug 25, 2016

github-actions bot commented Feb 17, 2020

louden commented Apr 19, 2016 •

edited

Loading