-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[R-package] Allow for valid data to be a slice of the training data #6844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[R-package] Allow for valid data to be a slice of the training data #6844
Conversation
Slices of datasets already contain a reference to the base data (e.g., R/lgb.Dataset.R:562). This change only sets a reference to the base dataset if one does not already exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to contribute!
I've added a link to #6008 to the description... please do that when you contribute in the future and where your contribution is related to previous discussions. It helps to answer the question "what is the benefit of this change?" for reviewers.
Before we review this closely (I haven't yet), please also add a unit test covering this behavior. That'll help build our confidence in this fix, and prevent the bug from being reintroduced in the future.
Add that somewhere in https://github.com/microsoft/LightGBM/blob/master/R-package/tests/testthat/test_basic.R. Try to put it near related tests, and follow all the patterns you see there (for code style, passing .LGB_VERBOSITY
. and .LGB_MAX_THREADS
, etc.).
@microsoft-github-policy-service agree |
Hey @jameslamb! I wanted to clarify whether I should be periodically updating my branch when this PR becomes out of date with the base branch. Also, I see one of the tests failed but it says it was related to Let me know your thoughts! Really appreciate the opportunity to contribute to LightGBM 😄 |
Good question! The answer to that might be different in other projects, but here in LightGBM it's fine to update this branch whenever you want... our CI is fairly cheap to run, and the project rarely has more than 5 concurrent active builds happening at the same time. And besides just being "fine", it is generally helpful... staying up to date improves our confidence that merging the PR is safe.
Yeah I agree, it's not related to your PR. I've documented this at #6855 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for taking the time to contribute! Please see the comment I left... I think this problem has a different root cause and requires a different approach than what's currently proposed in the diff.
R-package/R/lgb.train.R
Outdated
# No need to set reference if one exists | ||
if (is.null(valid_data$.__enclos_env__$private$reference)) { | ||
valid_data$set_reference(data) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is quite right.
This would result in not attempting the update if the validation Dataset
has any reference. That's a problem, because it means that if the training data and validation data have different bin mappers, we might not find that out until much later in the program (and probably with hard-to-understand errors or maybe even more serious problems like segfaults).
(if you're not sure what the terms "reference" or "bin mapper" mean in the context of LightGBM, please tell me and I'll explain)
I think the root cause of this bug is actually right here:
LightGBM/R-package/R/lgb.Dataset.R
Lines 647 to 652 in 6437645
set_reference = function(reference) { | |
# setting reference to this same Dataset object doesn't require any changes | |
if (identical(private$reference, reference)) { | |
return(invisible(self)) | |
} |
When private$reference
is not NULL
, it's an lgb.Dataset
(an {R6}
class instance).
library(lightgbm)
data("iris")
ds <- lgb.Dataset(
as.matrix(iris[, seq_len(3L)])
, label = iris[, 4L]
)
dtrain <- lgb.slice.Dataset(ds, seq_len(100L))
dtrain$.__enclos_env__$private$reference
# <lgb.Dataset>
# Public:
# construct: function ()
# create_valid: function (data, label = NULL, weight = NULL, group = NULL, init_score = NULL,
Using identical()
on 2 {R6}
class instances is not reliable... it can be affected by implementation details of how {R6}
works:
- Implement all.equal.R6 to cope with handling of ABs in recent R versions r-lib/R6#208
- Comparing two objects of the same R6 class for equality r-lib/R6#211
But for that check, it does not matter whether they're the same exact R object (in fact, we can be confident that they aren't)... it's just important that their handles point to the same Dataset
on the C++ side.
I don't really want to take on maintaining or implementing an ==.lgb.Dataset()
generic here (as suggested in r-lib/R6#211)... for this purpose, I think we should try an internal function that only checks the equality of the characteristics we care about.
Maybe something like this in lgb.Dataset.R
(you may have to experiment with this a bit):
.datasets_are_equal <- function(ds1, ds2) {
if (is.null(ds1) && is.null(ds2)) {
return(TRUE)
}
if (is.null(ds1) || is.null(ds2)) {
return(FALSE)
}
return(
identical(
ds1$.__enclos_env__$private$get_handle()
, ds1$.__enclos_env__$private$get_handle()
)
)
}
And then replacing identical()
with a call to that function in the spot I linked to above.
Are you interested in attempting that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate you taking the time to break that down! This makes sense so far. I’d like to make an attempt at those changes so I can understand the codebase better. Thanks!
} | ||
|
||
# The new reference and the existing object may be slices of a parent dataset | ||
if (.datasets_are_equal(private$reference, reference$.__enclos_env__$private$reference)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So adding .datasets_are_equal()
and running these two if {}
checks does "fix" the issue in the sense that we pass all unit tests. However, this line in particular highlights a deeper problem with this special case where data
and valids
are slices of some other dataset. Please correct me if I am misunderstanding here, but I don't believe it is enough to check whether data
and valids
share the same handle in this case (line 661) but we also need to check whether these two datasets share a parent reference (line 666).
However, if we create a slice of a slice, the suggested changes above still break. For example:
ds <- lgb.Dataset(
as.matrix(iris[, seq_len(3L)])
, label = iris[, 4L]
)
train <- lgb.slice.Dataset(ds, seq_len(100L))
test <- lgb.slice.Dataset(ds, seq_len(50L) + 100L)
# Create a slice of a slice
test2 <- lgb.slice.Dataset(test, seq_len(50L))
model <- lgb.train(
params = list(objective = "regression"),
, data = train
, nrounds = 1L
, valids = list(test2 = test2)
)
#> Error in `valid_data$set_reference()`:
#> ! set_reference: cannot set reference after freeing raw data,
#> please set 'free_raw_data = FALSE' when you construct lgb.Dataset
So I think we have a few options:
- Only allow slices one level deep and create a new, more informative error if that is not the case
- Create a recursive or similar system to check for deeper references
- Something else entirely! I am still becoming more familiar with how references work in LightGBM so its possible that there is some cleaner or more obvious solution I am missing there.
What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working with slices-of-slices of Dataset
objects is not something that's well-tested in LightGBM. This is the first time I recall discussing it here or thinking about it.
Does slices-of-slices already fail in the same way on master
? If they do, then the changes in this PR are still a net improvement, and would fix #6008. And if that's true, then I think we should move forward with this PR ignoring slices-of-slices, and I'd appreciate if you could write up a separate issue about how those are handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha! I just created #6880 and will sit tight and wait for additional direction on this PR for now.
R-package/R/lgb.Dataset.R
Outdated
.datasets_are_equal <- function(ds1, ds2) { | ||
handler1 <- ds1$.__enclos_env__$private$get_handle | ||
handler2 <- ds2$.__enclos_env__$private$get_handle | ||
if (is.null(handler1) || is.null(handler2)) { | ||
return(FALSE) | ||
} | ||
return(identical(handler1(), handler2())) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this implementation. I suspect you might not have understood what "handle" means here.
get_handle()
is a function that returns a pointer:
LightGBM/R-package/R/lgb.Dataset.R
Lines 716 to 724 in 82c85e4
get_handle = function() { | |
# Get handle and construct if needed | |
if (.is_null_handle(x = private$handle)) { | |
self$construct() | |
} | |
return(private$handle) | |
}, |
ds1$.__enclos_env__$private$get_handle
will never be true... it will always be a function.
Also notice there that get_handle()
will call self$construct()
if the handle
is NULL... so it should never return NULL
.
I think that here, it would be better to just compare ds1$.__enclos_env__$private$handle
to ds2$.__enclos_env__$private$handle
. Can you please try that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and implemented that with the addition of .is_Dataset()
checks at the top. This implementation avoids some of the confusion from before:
.datasets_are_equal <- function(ds1, ds2) {
if (!.is_Dataset(ds1) || !.is_Dataset(ds2)) {
return(FALSE)
}
return(identical(
ds1$.__enclos_env__$private$handle
, ds2$.__enclos_env__$private$handle
))
}
This is because I check if the reference
shares the same handle as the lgb.Dataset
however, the new reference
may share the same parent handle as the lgb.Dataset
. This gets into some of the issues I touched on in #6880. For now, we are checking one level deep. If the new reference
getting passed in here does not have a parent reference, this will be NULL
and will not have a $handle
on the second if () {}
here:
if (.datasets_are_equal(private$reference, reference)) {
return(invisible(self))
}
# The new reference and the existing object may be slices of a parent dataset
if (.datasets_are_equal(private$reference, reference$.__enclos_env__$private$reference)) {
return(invisible(self))
}
Additionally, if I don't include the .is_Dataset()
checks there are many test failures saying:
[LightGBM] [Fatal] Cannot add validation data, since it has different bin mappers with training data
/AzurePipelines run |
fixes #6008
Slices of datasets already contain a reference to the base data. This change only sets a reference to the base dataset if one does not already exist.