Submission: featureselection (R) #33

techrah · 2020-03-17T21:20:01Z

Submitting Author: Ryan Homer (@ryanhomer), Jacky Ho (@jackyho112), Derek Kruszewski (@dkruszew), Victor Cuspinera (@vcuspinera)
Repository: https://github.com/UBC-MDS/feature-selection-r
Version submitted: 1.1.0
Editor: Varada Kolhatkar (@kvarada)
Reviewer 1: Anand Shankar Vemparala (@vanandsh)
Reviewer 2: Suvarna Moharir (@suvarna-m)
Archive: TBD
Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below:

Package: featureselection
Title: Feature Selection
Version: 1.1.0.0
Authors@R: c(person("Victor", "Cuspinera", role = c("aut"), email = "[email protected]"),
             person("Ryan", "Homer", role = c("aut", "cre"), email = "[email protected]"),
             person("Jacky", "Ho", role = c("aut"), email = "[email protected]"),
             person("Derek", "Kruszewski", role = c("aut"), email = "[email protected]"))
Description: Feature Selection with machine learning models.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.1.0
Suggests:
    testthat (>= 2.1.0),
    tgp (>= 2.4.14),
    covr,
    knitr,
    rmarkdown
Imports:
    dplyr (>= 0.8.0),
    purrr (>= 0.3.0),
    stats
URL: https://github.com/UBC-MDS/feature-selection-r
BugReports: https://github.com/UBC-MDS/feature-selection-r/issues
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- data retrieval
- data extraction
- data munging
- data deposition
- workflow automataion
- version control
- citation management and bibliometrics
- scientific software wrappers
- database software bindings
- geospatial data
- text analysis
- machine learning
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

This package implements data feature selections algorithms. It is expected that the user will make use of packages such as [caret][2] in order to do the actual model fitting and scoring. This package then makes use of these results to carry out feature selection.
Who is the target audience and what are scientific applications of this package?

It is expected that this package will be helpful to mainly data scientists, students of data science, and in general, anyone involved in machine learning.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

Some of the feature selection algorithms exists on the R platform. Others either do not have an implementation or their implemetation is fairly complex. This package aims to provide easy-to-use implementations of feature selection algorithms and to provide a more seamless experience between the R and Python platforms by providing a companion Python edition that works in a very similar way.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

n/a

Technical checks

Confirm each of the following by checking the box.

I have read the guide for authors and rOpenSci packaging guide.

This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license.
contains a README with instructions for installing the development version.
includes documentation with examples for all functions, created with roxygen2.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

Do you intend for this package to go on CRAN?
Do you intend for this package to go on Bioconductor?
Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Options

The package has an obvious research application according to JOSS's definition.
- The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)

Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

The package is novel and will be of interest to the broad readership of the journal.
The manuscript describing the package is no longer than 3000 words.
You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
(Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
(Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
(Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

The text was updated successfully, but these errors were encountered:

vanandsh · 2020-03-21T16:15:22Z

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of the package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience.

References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:
3 hours

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

A great choice to bundle very relevant and practical features that can be used in almost every Machine Learning pipeline. Implementing them from scratch must have been a great experience and the authors did an awesome job.

I have listed my feedbacks and suggestions below:

General:

The instructions were clearly mentioned in the readme and the installation went through successfully.
There were grammatical errors in the README and it would be great if those could be fixed. The spellings are off in some places and that makes the package appear less professional.
The functions are well documented and I could get going with the functions very quickly using the examples provided.
All dependencies were listed accurately
Although the code coverage is pretty good, with very minimal effort they can be pushed to a 100%
All unit tests passed successfully

Vignettes:

The vignettes failed to run locally
They were accessible from the link provided on the main page of the repository.

Overall code

Since the scorer function is custom-made, I would recommend adding a test to ensure the function returns a numeric

feature_selection.R

flag_keep_running = TRUE
flag_stop_running = FALSE

R standard assignment operator is '<-', so this could be replaced

#' @param X tibble. training dataset
#' @param y tibble. test dataset

The summary of the parameter does not seem to be correct. I would assume X has values of the explanatory variable while y is the target.

(sum(!(dim(X))==1) != 2)

While checking for the dimensions, the above code seems uselessly complex and unreadable. This could be replaced by a simple length check of the dimension (length(dim(X)) != 2)

featureselection::forward_selection(custom_scorer, X, y, 9, 10)
[1] 4 2 5 1 8 9 6 3 3

features selected start repeating once the best features are achieved

featureselection::forward_selection(custom_scorer, X, y, 14, 15)
[1] 4 2 5 1 8 9 6 3 3 3 3 3 3 3

values beyond the dimensions are also being allowed. This should either be capped or an error should be thrown

simulated_annealing.R

I am not sure if it is a good suggestion, but I was thinking a great addition to the function would be an argument that controls the randomness of the features being selected for reproducibility, something like random_state in scikit-learn.

recursive_feature_elimination.R

df <- tgp::friedman.1.data()
data <- dplyr::select(df, -Ytrue)
X <- dplyr::select(data, X1, X2)
y <- dplyr::select(data, Y)
featureselection::recursive_feature_elimination(custom_scorer_fn, X, y, 2)

Error in featureselection::recursive_feature_elimination(custom_scorer_fn, : Expected a value between 1 and 0 for n_features_to_select.

The function fails when we try to choose all features, i.e. n_features_to_select = dim(X)[2]. Although this is not an ideal test and defeats the purpose of the function, I would still expect it to return all the features or an error stating the same.

variance_thresholding.R

The for loop iterates over all column names but the names are not being used anywhere in the function, and i has to be additionally declared beyond the scope of the loop and incremented in the run. A better way to do this would be

for (i in 1:length(names(data))) {
}

In general, the package is great for feature selection and delivers what it promises. It was easy to use and it helped me revisit some of the algorithms we learned earlier in the course. :)

The authors have bundled the most widely used feature selection algorithms in a very short time. With the addition of other algorithms that help in feature selection, this package could become extremely useful and have a wide usage.

suvarna-m · 2020-03-21T23:27:06Z

Package Review

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience.

References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2-3 hours

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Note: I agree with the previous reviewer's comments, my comments are in addition to theirs.

General Comments

Overall, very well done! This is a really useful package for many feature selection issues. The functions were really well thought out.
Great job on the documentation for your functions. I found your asserts and functions easy to follow, especially with the comments and well-written docstrings.
Your README has a really great description of the packages and your example code in the README is useful; both of these helped me gain a deeper understanding of your package even before I looked through your code.

Suggestions for Improvement

For your README, I would suggest having separate code chunks for the code and the output. I initially didn't realize that the output was included in the last line of the code chunk, and copied and pasted the whole code cell. It would also be helpful to have a brief (~1 line) description of what the output means.
I would recommend adding at least a couple comments into the variance_thresholding function. Although the function itself is simpler than the rest of your functions and may seem self-explanatory, adding a couple simple explanations about what the function/asserts are doing would make this code more consistent with your other functions.
Although a minor point, your Code of Conduct should include a valid email in it - currently, it seems to be a default email address that doesn't belong to any of you.
There is no vignette.Rmd in your repository, so it does not run locally - it seems like it may have been removed. The vignette seems to be on the package website, but is not in the right location - it should be in an Articles tab. I would recommend going through the Vignette Chapter of the R Whole Game as there seems to be a step you may have missed.
I would recommend updating your docstring for variance_thresholding to reflect that the output is a double/double vector.
Your code coverage of 95% is good and meets the MDS requirements. However, your package could be improved by adding a few more tests. More specifically, the forward_selection function could use tests for the following asserts:

a)

if (sum(!(dim(X))==1) != 2){
    stop("X must be a 2-d array")
  }

b)

 if (sum(!(dim(X))==1) != 2){
    stop("X must be a 2-d array")
  }

c)

 if (dim(X)[1] != dim(y)[1]){
    stop("X and y have inconsistent numbers of samples. X:", dim(X)[1], ", y:", dim(y)[1])
  }

Final Comments

In general, this was a well thought out package. My suggestions are mostly minor, please feel free to consult with your team to make decisions about which suggestions you would like to implement, and please comment below if you have any questions.

vcuspinera · 2020-03-26T03:33:52Z

Thank you for your comments @vanandsh and @suvarna-m, we appreciate your help and the time you took not only going in the general about the package, but also deeper with the code and give us useful comments! Cheers

dkruszew · 2020-03-26T18:25:59Z

Thanks @suvarna-m and @vanandsh for your review. Here is the summary of our responses.

Please see release 1.1.8 for these changes.

Summary Responses:

There were grammatical errors in the README and it would be great if those could be fixed. The spellings are off in some places and that makes the package appear less professional.

The README.md file has been updated to correct grammatical errors.

Although the code coverage is pretty good, with very minimal effort they can be pushed to a 100%

We have increased our coverage from 95% to 98%

The vignettes failed to run locally

We decided to not implement vignettes since the README and the documentation site created with pkgdown were already quite comprehensive for the fairly simple examples.

Since the scorer function is custom-made, I would recommend adding a test to ensure the function returns a numeric

Thank you for the suggestion. We decided to not address this issue in this release.

feature_selection.R
flag_keep_running = TRUE
flag_stop_running = FALSE
R standard assignment operator is '<-', so this could be replaced

The code has been fixed.

#' @param X tibble. training dataset
#' @param y tibble. test dataset
The summary of the parameter does not seem to be correct. I would assume X has values of the explanatory variable while y is the target.

The roxygen comments have been updated to be more consistent with the other functions.

(sum(!(dim(X))==1) != 2)

While checking for the dimensions, the above code seems uselessly complex and unreadable. This could be replaced by a simple length check of the dimension (length(dim(X)) != 2)

The code has been refactored to address this.

featureselection::forward_selection(custom_scorer, X, y, 9, 10)
[1] 4 2 5 1 8 9 6 3 3

features selected start repeating once the best features are achieved

Not address in this release.

featureselection::forward_selection(custom_scorer, X, y, 14, 15)
[1] 4 2 5 1 8 9 6 3 3 3 3 3 3 3

values beyond the dimensions are also being allowed. This should either be capped or an error should be thrown

Checks and tests have been added to address this.

simulated_annealing.R

I am not sure if it is a good suggestion, but I was thinking a great addition to the function would be an argument that controls the randomness of the features being selected for reproducibility, something like random_state in scikit-learn.

We have added a random_state parameter to address this.

recursive_feature_elimination.R
df <- tgp::friedman.1.data()
data <- dplyr::select(df, -Ytrue)
X <- dplyr::select(data, X1, X2)
y <- dplyr::select(data, Y)
featureselection::recursive_feature_elimination(custom_scorer_fn, X, y, 2)
Error in featureselection::recursive_feature_elimination(custom_scorer_fn, : Expected a value between 1 and 0 for n_features_to_select.

The function fails when we try to choose all features, i.e. n_features_to_select = dim(X)[2]. Although this is not an ideal test and defeats the purpose of the function, I would still expect it to return all the features or an error stating the same.

This will be addressed in a future release.

variance_thresholding.R

The for loop iterates over all column names but the names are not being used anywhere in the function, and i has to be additionally declared beyond the scope of the loop and incremented in the run. A better way to do this would be
for (i in 1:length(names(data))) {
}

The code was refactored to address this.

For your README, I would suggest having separate code chunks for the code and the output. I initially didn't realize that the output was included in the last line of the code chunk, and copied and pasted the whole code cell. It would also be helpful to have a brief (~1 line) description of what the output means.

Some results are now generated automatically and all output is prefixed with # so that if they get inadvertently copied and pasted, they will not cause an error.

I would recommend adding at least a couple comments into the variance_thresholding function. Although the function itself is simpler than the rest of your functions and may seem self-explanatory, adding a couple simple explanations about what the function/asserts are doing would make this code more consistent with your other functions.

Comments have been added to the code for improved clarity.

Although a minor point, your Code of Conduct should include a valid email in it - currently, it seems to be a default email address that doesn't belong to any of you.

File has been updated.

There is no vignette.Rmd in your repository, so it does not run locally - it seems like it may have been removed. The vignette seems to be on the package website, but is not in the right location - it should be in an Articles tab. I would recommend going through the Vignette Chapter of the R Whole Game as there seems to be a step you may have missed.

See above comment regarding this.

I would recommend updating your docstring for variance_thresholding to reflect that the output is a double/double vector.

Roxygen comments have been updated to reflect this.

Your code coverage of 95% is good and meets the MDS requirements. However, your package could be improved by adding a few more tests. More specifically, the forward_selection function could use tests for the following asserts:

a)
if (sum(!(dim(X))==1) != 2){
   stop("X must be a 2-d array")
 }
b)
if (sum(!(dim(X))==1) != 2){
   stop("X must be a 2-d array")
 }
c)
if (dim(X)[1] != dim(y)[1]){
   stop("X and y have inconsistent numbers of samples. X:", dim(X)[1], ", y:", dim(y)[1])
 }

These test have been added and coverage is now up to 98%

techrah assigned vanandsh and suvarna-m Mar 17, 2020

jackyho112 mentioned this issue Mar 23, 2020

Refactor variance thresholding UBC-MDS/feature-selection-r#59

Merged

This was referenced Mar 26, 2020

Add tests to improve the coverage of forward function UBC-MDS/feature-selection-r#65

Closed

Forward function notation UBC-MDS/feature-selection-r#62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission: featureselection (R) #33

Submission: featureselection (R) #33

techrah commented Mar 17, 2020 •

edited

Loading

vanandsh commented Mar 21, 2020 •

edited

Loading

For packages co-submitting to JOSS

suvarna-m commented Mar 21, 2020 •

edited

Loading

For packages co-submitting to JOSS

vcuspinera commented Mar 26, 2020

dkruszew commented Mar 26, 2020 •

edited by techrah

Loading

feature_selection.R

simulated_annealing.R

recursive_feature_elimination.R

variance_thresholding.R

Submission: featureselection (R) #33

Submission: featureselection (R) #33

Comments

techrah commented Mar 17, 2020 • edited Loading

Scope

Technical checks

Publication options

Code of conduct

vanandsh commented Mar 21, 2020 • edited Loading

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

feature_selection.R

simulated_annealing.R

recursive_feature_elimination.R

variance_thresholding.R

suvarna-m commented Mar 21, 2020 • edited Loading

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

vcuspinera commented Mar 26, 2020

dkruszew commented Mar 26, 2020 • edited by techrah Loading

Summary Responses:

feature_selection.R

simulated_annealing.R

recursive_feature_elimination.R

variance_thresholding.R

techrah commented Mar 17, 2020 •

edited

Loading

vanandsh commented Mar 21, 2020 •

edited

Loading

suvarna-m commented Mar 21, 2020 •

edited

Loading

dkruszew commented Mar 26, 2020 •

edited by techrah

Loading