Skip to content

Submission: featureselection (R) #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
8 of 31 tasks
techrah opened this issue Mar 17, 2020 · 4 comments
Open
8 of 31 tasks

Submission: featureselection (R) #33

techrah opened this issue Mar 17, 2020 · 4 comments
Assignees

Comments

@techrah
Copy link

techrah commented Mar 17, 2020

Submitting Author: Ryan Homer (@ryanhomer), Jacky Ho (@jackyho112), Derek Kruszewski (@dkruszew), Victor Cuspinera (@vcuspinera)
Repository: https://github.com/UBC-MDS/feature-selection-r
Version submitted: 1.1.0
Editor: Varada Kolhatkar (@kvarada)
Reviewer 1: Anand Shankar Vemparala (@vanandsh)
Reviewer 2: Suvarna Moharir (@suvarna-m)
Archive: TBD
Version accepted: TBD


  • Paste the full DESCRIPTION file inside a code block below:
Package: featureselection
Title: Feature Selection
Version: 1.1.0.0
Authors@R: c(person("Victor", "Cuspinera", role = c("aut"), email = "[email protected]"),
             person("Ryan", "Homer", role = c("aut", "cre"), email = "[email protected]"),
             person("Jacky", "Ho", role = c("aut"), email = "[email protected]"),
             person("Derek", "Kruszewski", role = c("aut"), email = "[email protected]"))
Description: Feature Selection with machine learning models.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.1.0
Suggests:
    testthat (>= 2.1.0),
    tgp (>= 2.4.14),
    covr,
    knitr,
    rmarkdown
Imports:
    dplyr (>= 0.8.0),
    purrr (>= 0.3.0),
    stats
URL: https://github.com/UBC-MDS/feature-selection-r
BugReports: https://github.com/UBC-MDS/feature-selection-r/issues
VignetteBuilder: knitr

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • data munging
    • data deposition
    • workflow automataion
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • database software bindings
    • geospatial data
    • text analysis
    • machine learning
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

    This package implements data feature selections algorithms. It is expected that the user will make use of packages such as [caret][2] in order to do the actual model fitting and scoring. This package then makes use of these results to carry out feature selection.

  • Who is the target audience and what are scientific applications of this package?

    It is expected that this package will be helpful to mainly data scientists, students of data science, and in general, anyone involved in machine learning.

  • Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

    Some of the feature selection algorithms exists on the R platform. Others either do not have an implementation or their implemetation is fairly complex. This package aims to provide easy-to-use implementations of feature selection algorithms and to provide a more seamless experience between the R and Python platforms by providing a companion Python edition that works in a very similar way.

  • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

    n/a

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you intend for this package to go on Bioconductor?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
JOSS Options
  • The package has an obvious research application according to JOSS's definition.
    • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)
MEE Options
  • The package is novel and will be of interest to the broad readership of the journal.
  • The manuscript describing the package is no longer than 3000 words.
  • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

@vanandsh
Copy link

vanandsh commented Mar 21, 2020

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of the package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:
3 hours

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

A great choice to bundle very relevant and practical features that can be used in almost every Machine Learning pipeline. Implementing them from scratch must have been a great experience and the authors did an awesome job.

I have listed my feedbacks and suggestions below:

General:

  • The instructions were clearly mentioned in the readme and the installation went through successfully.
  • There were grammatical errors in the README and it would be great if those could be fixed. The spellings are off in some places and that makes the package appear less professional.
  • The functions are well documented and I could get going with the functions very quickly using the examples provided.
  • All dependencies were listed accurately
  • Although the code coverage is pretty good, with very minimal effort they can be pushed to a 100%
  • All unit tests passed successfully

Vignettes:

  • The vignettes failed to run locally
  • They were accessible from the link provided on the main page of the repository.

Overall code

  • Since the scorer function is custom-made, I would recommend adding a test to ensure the function returns a numeric
feature_selection.R
flag_keep_running = TRUE
flag_stop_running = FALSE
  • R standard assignment operator is '<-', so this could be replaced
#' @param X tibble. training dataset
#' @param y tibble. test dataset
  • The summary of the parameter does not seem to be correct. I would assume X has values of the explanatory variable while y is the target.

(sum(!(dim(X))==1) != 2)

  • While checking for the dimensions, the above code seems uselessly complex and unreadable. This could be replaced by a simple length check of the dimension (length(dim(X)) != 2)

featureselection::forward_selection(custom_scorer, X, y, 9, 10)
[1] 4 2 5 1 8 9 6 3 3

  • features selected start repeating once the best features are achieved

featureselection::forward_selection(custom_scorer, X, y, 14, 15)
[1] 4 2 5 1 8 9 6 3 3 3 3 3 3 3

  • values beyond the dimensions are also being allowed. This should either be capped or an error should be thrown
simulated_annealing.R
  • I am not sure if it is a good suggestion, but I was thinking a great addition to the function would be an argument that controls the randomness of the features being selected for reproducibility, something like random_state in scikit-learn.
recursive_feature_elimination.R
df <- tgp::friedman.1.data()
data <- dplyr::select(df, -Ytrue)
X <- dplyr::select(data, X1, X2)
y <- dplyr::select(data, Y)
featureselection::recursive_feature_elimination(custom_scorer_fn, X, y, 2)

Error in featureselection::recursive_feature_elimination(custom_scorer_fn, : Expected a value between 1 and 0 for n_features_to_select.

  • The function fails when we try to choose all features, i.e. n_features_to_select = dim(X)[2]. Although this is not an ideal test and defeats the purpose of the function, I would still expect it to return all the features or an error stating the same.
variance_thresholding.R
  • The for loop iterates over all column names but the names are not being used anywhere in the function, and i has to be additionally declared beyond the scope of the loop and incremented in the run. A better way to do this would be
for (i in 1:length(names(data))) {
}

In general, the package is great for feature selection and delivers what it promises. It was easy to use and it helped me revisit some of the algorithms we learned earlier in the course. :)

The authors have bundled the most widely used feature selection algorithms in a very short time. With the addition of other algorithms that help in feature selection, this package could become extremely useful and have a wide usage.

@suvarna-m
Copy link

suvarna-m commented Mar 21, 2020

Package Review

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2-3 hours

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Note: I agree with the previous reviewer's comments, my comments are in addition to theirs.

General Comments

  • Overall, very well done! This is a really useful package for many feature selection issues. The functions were really well thought out.

  • Great job on the documentation for your functions. I found your asserts and functions easy to follow, especially with the comments and well-written docstrings.

  • Your README has a really great description of the packages and your example code in the README is useful; both of these helped me gain a deeper understanding of your package even before I looked through your code.

Suggestions for Improvement

  • For your README, I would suggest having separate code chunks for the code and the output. I initially didn't realize that the output was included in the last line of the code chunk, and copied and pasted the whole code cell. It would also be helpful to have a brief (~1 line) description of what the output means.

  • I would recommend adding at least a couple comments into the variance_thresholding function. Although the function itself is simpler than the rest of your functions and may seem self-explanatory, adding a couple simple explanations about what the function/asserts are doing would make this code more consistent with your other functions.

  • Although a minor point, your Code of Conduct should include a valid email in it - currently, it seems to be a default email address that doesn't belong to any of you.

  • There is no vignette.Rmd in your repository, so it does not run locally - it seems like it may have been removed. The vignette seems to be on the package website, but is not in the right location - it should be in an Articles tab. I would recommend going through the Vignette Chapter of the R Whole Game as there seems to be a step you may have missed.

  • I would recommend updating your docstring for variance_thresholding to reflect that the output is a double/double vector.

  • Your code coverage of 95% is good and meets the MDS requirements. However, your package could be improved by adding a few more tests. More specifically, the forward_selection function could use tests for the following asserts:

a)

if (sum(!(dim(X))==1) != 2){
    stop("X must be a 2-d array")
  }

b)

 if (sum(!(dim(X))==1) != 2){
    stop("X must be a 2-d array")
  }

c)

 if (dim(X)[1] != dim(y)[1]){
    stop("X and y have inconsistent numbers of samples. X:", dim(X)[1], ", y:", dim(y)[1])
  }

Final Comments

  • In general, this was a well thought out package. My suggestions are mostly minor, please feel free to consult with your team to make decisions about which suggestions you would like to implement, and please comment below if you have any questions.

@vcuspinera
Copy link

Thank you for your comments @vanandsh and @suvarna-m, we appreciate your help and the time you took not only going in the general about the package, but also deeper with the code and give us useful comments! Cheers

@dkruszew
Copy link

dkruszew commented Mar 26, 2020

Thanks @suvarna-m and @vanandsh for your review. Here is the summary of our responses.

Please see release 1.1.8 for these changes.

Summary Responses:

  • There were grammatical errors in the README and it would be great if those could be fixed. The spellings are off in some places and that makes the package appear less professional.

The README.md file has been updated to correct grammatical errors.

  • Although the code coverage is pretty good, with very minimal effort they can be pushed to a 100%

We have increased our coverage from 95% to 98%

  • The vignettes failed to run locally

We decided to not implement vignettes since the README and the documentation site created with pkgdown were already quite comprehensive for the fairly simple examples.

  • Since the scorer function is custom-made, I would recommend adding a test to ensure the function returns a numeric

Thank you for the suggestion. We decided to not address this issue in this release.

feature_selection.R
flag_keep_running = TRUE
flag_stop_running = FALSE
  • R standard assignment operator is '<-', so this could be replaced

The code has been fixed.

#' @param X tibble. training dataset
#' @param y tibble. test dataset
  • The summary of the parameter does not seem to be correct. I would assume X has values of the explanatory variable while y is the target.

The roxygen comments have been updated to be more consistent with the other functions.

(sum(!(dim(X))==1) != 2)

  • While checking for the dimensions, the above code seems uselessly complex and unreadable. This could be replaced by a simple length check of the dimension (length(dim(X)) != 2)

The code has been refactored to address this.

featureselection::forward_selection(custom_scorer, X, y, 9, 10)
[1] 4 2 5 1 8 9 6 3 3

  • features selected start repeating once the best features are achieved

Not address in this release.

featureselection::forward_selection(custom_scorer, X, y, 14, 15)
[1] 4 2 5 1 8 9 6 3 3 3 3 3 3 3

  • values beyond the dimensions are also being allowed. This should either be capped or an error should be thrown

Checks and tests have been added to address this.

simulated_annealing.R
  • I am not sure if it is a good suggestion, but I was thinking a great addition to the function would be an argument that controls the randomness of the features being selected for reproducibility, something like random_state in scikit-learn.

We have added a random_state parameter to address this.

recursive_feature_elimination.R
df <- tgp::friedman.1.data()
data <- dplyr::select(df, -Ytrue)
X <- dplyr::select(data, X1, X2)
y <- dplyr::select(data, Y)
featureselection::recursive_feature_elimination(custom_scorer_fn, X, y, 2)

Error in featureselection::recursive_feature_elimination(custom_scorer_fn, : Expected a value between 1 and 0 for n_features_to_select.

  • The function fails when we try to choose all features, i.e. n_features_to_select = dim(X)[2]. Although this is not an ideal test and defeats the purpose of the function, I would still expect it to return all the features or an error stating the same.

This will be addressed in a future release.

variance_thresholding.R
  • The for loop iterates over all column names but the names are not being used anywhere in the function, and i has to be additionally declared beyond the scope of the loop and incremented in the run. A better way to do this would be
for (i in 1:length(names(data))) {
}

The code was refactored to address this.

  • For your README, I would suggest having separate code chunks for the code and the output. I initially didn't realize that the output was included in the last line of the code chunk, and copied and pasted the whole code cell. It would also be helpful to have a brief (~1 line) description of what the output means.

Some results are now generated automatically and all output is prefixed with # so that if they get inadvertently copied and pasted, they will not cause an error.

  • I would recommend adding at least a couple comments into the variance_thresholding function. Although the function itself is simpler than the rest of your functions and may seem self-explanatory, adding a couple simple explanations about what the function/asserts are doing would make this code more consistent with your other functions.

Comments have been added to the code for improved clarity.

  • Although a minor point, your Code of Conduct should include a valid email in it - currently, it seems to be a default email address that doesn't belong to any of you.

File has been updated.

  • There is no vignette.Rmd in your repository, so it does not run locally - it seems like it may have been removed. The vignette seems to be on the package website, but is not in the right location - it should be in an Articles tab. I would recommend going through the Vignette Chapter of the R Whole Game as there seems to be a step you may have missed.

See above comment regarding this.

  • I would recommend updating your docstring for variance_thresholding to reflect that the output is a double/double vector.

Roxygen comments have been updated to reflect this.

  • Your code coverage of 95% is good and meets the MDS requirements. However, your package could be improved by adding a few more tests. More specifically, the forward_selection function could use tests for the following asserts:

a)

if (sum(!(dim(X))==1) != 2){
   stop("X must be a 2-d array")
 }

b)

if (sum(!(dim(X))==1) != 2){
   stop("X must be a 2-d array")
 }

c)

if (dim(X)[1] != dim(y)[1]){
   stop("X and y have inconsistent numbers of samples. X:", dim(X)[1], ", y:", dim(y)[1])
 }

These test have been added and coverage is now up to 98%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants