-
Notifications
You must be signed in to change notification settings - Fork 0
Submission: pylaundry (Python) #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Readme requirements
The README should include, from top to bottom:
Functionality
Final approval (post-review)
Estimated hours spent reviewing:
Review CommentsGeneral CommentsOverall I think you have done an excellent job with this packages. I think it addresses a specific need, and your documentation makes it very easy to understand how the package works. All of the feedback I have is pretty minor. Once addressed, I would feel confident approving this package. Specific IssuesIssue 1: Examples in README do not work as shown. For example: >>> from pylaundry import categorize
>>> import pandas
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
... 'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
... 'c':['A','B','C','D','E','F','G','H','I','J']})
>>> # categorize with default max_cat (10)
>>> pylaundry.categorize(df)
NameError: name 'pylaundry' is not defined I think the issue is regarding how you are importing. If I do the following, it works: >>> from pylaundry.categorize import categorize
>>> import pandas
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
... 'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
... 'c':['A','B','C','D','E','F','G','H','I','J']})
>>> # categorize with default max_cat (10)
>>> pylaundry.categorize(df)
{'numeric': ['b'], 'categorical': ['c', 'a']} I think this will hold true with all of your examples. Issue 2: Typos There are several typos in the REAMDE. I quick spell check should fix this. For example Issue 3: Missing examples in doc strings
Issue 4: Other options for checking column types In your README you mention you are not aware of any other packages for categorizing columns. You may want to mention Issue 5: Excessive functions exposed to the user in documentation. Looking at readthedocs there are extra functions here exposed to the user. I would expect to only see four. Is there a way to hide the ones that should not be user facing? Also some functions appear twice, making it a bit more confusing to read. |
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Readme requirements
The README should include, from top to bottom:
Functionality
Final approval (post-review)
Estimated hours spent reviewing:
Review CommentsGeneral Overview CommentsI was very impressed to see how much work and effort you have put into this package. It looks really professional. The documentation covers a lot of details about how it works. I especially like that you have included and workflow of how all the functions can be used in an integrated form. Specific IssuesIssue 1: Dependency Sklearn==0.22 caused installation problems: I actually had 0.22.1 and when you try to install you get an error message (see below) because you need scikit-learn==0.22.2. Here is the error I got:
And here I solved it:
My recommendation is to add a section under installation guidelines for installation troubleshooting with the error and the solution. Action:
Issue 2: Examples in README with output or input problems:
>>> from pylaundry.categorize import categorize
>>> import pandas
>>>
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
... 'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
... 'c':['A','B','C','D','E','F','G','H','I','J']})
>>>
>>> # categorize with default max_cat (10)
>>> categorize(df)
README OUTPUT:
{'numeric': ['b'],
'categorical': ['a', 'b']}
ACTUAL OUTPUT:
{'numeric': ['b'],
'categorical': ['c', 'a']}
>>>from pylaundry.fill_missing import fill_missing
>>>import pandas
>>>import numpy as np
>>>df_train = pandas.DataFrame({'a':[1, 2, np.NaN, 4, 4],
>>> 'b':[1.2, 3.4, 3.0, 4.9, np.NaN]})
>>>df_test = pandas.DataFrame({'a':[6, np.NaN, 0],
>>> 'b':[0.5, 9.2, np.NaN]})
>>>fill_missing(df_train, df_test, {'numeric':['b'], 'categorical':['a']},
>>> num_imp = 'median') There is a cat_imp missing argument that throws an error:
Action:
Issue 3: Besides the missing examples in docstrings for some functions as mentioned by SAM (the other reviewer), the example for the categorize function did not work properly: categorize(pd.DataFrame({‘numeric’:[1,2,3,4], ‘cat’:[‘a’,’b’,’c’]})) Here is the error message: "SyntaxError: invalid character in identifier" Action:
Issue 4: Proposed end-to-end workflow:
Action:
Issue 5: n_features in select_features fucntion: The Select features function actually provides features when number of features (n_features) argument is greater than true number of features. This is not mentioned in the docstrings or no error message is output from the function: Here is a snapshot of the issue: Action:
|
Hi @robilizando and @SamEdwardes, Sam's issues:
Robert's Issues:
Thanks again for taking the effort to review |
Submitting Author: Name (@cgostic, @zanderhinton, @amank90, @arunmarria)
Package Name: pylaundry
One-Line Description of Package: Perform standard preprocessing of a dataframe to be used in machine learning algorithms in a streamlined workflow.
Repository Link: https://github.com/UBC-MDS/pylaundry
Version submitted: 1.0.8
Editor: @kvarada
Reviewer 1: @SamEdwardes
Reviewer 2: @robilizando
Archive: TBD
Version accepted: TBD
Description
The
pylaundry
package performs many standard preprocessing techniques for Pandas dataframes, before use in statistical analysis and machine learning. The package functionality includes categorizing column types, handling missing data and imputation, transforming/standardizing columns and feature selection. Thepylaundry
package aims to remove much of the grunt work in the typical data science workflow, allowing the analyst maximum time and energy to devote to modelling!Scope
* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.
All functions in this package take a Pandas DataFrame as an input. Two of the functions return transformed DataFrames (filled missing values, encoded and scaled), and the other two functions return information about the data gleaned from the DataFrame itself (column types, and most important features).
Pylaundry is made for data scientists, or anyone applying statistical methods or machine learning algorithms to their data. It transforms a dataset into a format that is ready to be passed into a
.fit
method, with all NAs imputed, categorical columns encoded, numerical columns scaled, and important features identified.sklearn.Pipeline offers similar functionality for the fill_missing and transform_columns functions, where similar functions can be wrapped in a Pipeline and carried out sequentially. However, creating a pipeline is a several step process depending on how many types of transformations need to be performed. pylaundry can accomplish numerical and categorical transformations in one step.
There are many feature selection packages and functions, for instance sklearn.feature_selection, which carry out similar functionality to our
feature_selector
function. Pylaundry simplifies feature selection by linear and logistic regression into a single function.As far as we know, there are no similar packages for Categorizing Columns.
pylaundry
is the first package we are aware of to abstract away the full dataframe pre-processing workflow with a unified and simple API.@tag
the editor you contacted:Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication options
JOSS Checks
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Note: Do not submit your package separately to JOSS
Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Code of conduct
The text was updated successfully, but these errors were encountered: