Skip to content

[R-package] Allow for valid data to be a slice of the training data #6844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

walkerjameschris
Copy link
Contributor

@walkerjameschris walkerjameschris commented Feb 28, 2025

fixes #6008

Slices of datasets already contain a reference to the base data. This change only sets a reference to the base dataset if one does not already exist.

Slices of datasets already contain a reference to the base data (e.g., R/lgb.Dataset.R:562). This change only sets a reference to the base dataset if one does not already exist.
@jameslamb jameslamb changed the title Allow for valid data to be a slice of the main lgb.Dataset by setting references if one does not exist. [R-package] Allow for valid data to be a slice of the main lgb.Dataset by setting references if one does not exist. Feb 28, 2025
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to contribute!

I've added a link to #6008 to the description... please do that when you contribute in the future and where your contribution is related to previous discussions. It helps to answer the question "what is the benefit of this change?" for reviewers.

Before we review this closely (I haven't yet), please also add a unit test covering this behavior. That'll help build our confidence in this fix, and prevent the bug from being reintroduced in the future.

Add that somewhere in https://github.com/microsoft/LightGBM/blob/master/R-package/tests/testthat/test_basic.R. Try to put it near related tests, and follow all the patterns you see there (for code style, passing .LGB_VERBOSITY. and .LGB_MAX_THREADS, etc.).

@walkerjameschris
Copy link
Contributor Author

@microsoft-github-policy-service agree

@walkerjameschris
Copy link
Contributor Author

Hey @jameslamb!

I wanted to clarify whether I should be periodically updating my branch when this PR becomes out of date with the base branch. Also, I see one of the tests failed but it says it was related to knitr which I don’t think is impacted by my PR, especially given all other checks passed.

Let me know your thoughts! Really appreciate the opportunity to contribute to LightGBM 😄

@jameslamb
Copy link
Collaborator

I wanted to clarify whether I should be periodically updating my branch when this PR becomes out of date with the base branch.

Good question!

The answer to that might be different in other projects, but here in LightGBM it's fine to update this branch whenever you want... our CI is fairly cheap to run, and the project rarely has more than 5 concurrent active builds happening at the same time.

And besides just being "fine", it is generally helpful... staying up to date improves our confidence that merging the PR is safe.

Also, I see one of the tests failed but it says it was related to knitr which I don’t think is impacted by my PR, especially given all other checks passed.

Yeah I agree, it's not related to your PR. I've documented this at #6855

@jameslamb jameslamb changed the title [R-package] Allow for valid data to be a slice of the main lgb.Dataset by setting references if one does not exist. [R-package] Allow for valid data to be a slice of the main lgb.Dataset Mar 7, 2025
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for taking the time to contribute! Please see the comment I left... I think this problem has a different root cause and requires a different approach than what's currently proposed in the diff.

# No need to set reference if one exists
if (is.null(valid_data$.__enclos_env__$private$reference)) {
valid_data$set_reference(data)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite right.

This would result in not attempting the update if the validation Dataset has any reference. That's a problem, because it means that if the training data and validation data have different bin mappers, we might not find that out until much later in the program (and probably with hard-to-understand errors or maybe even more serious problems like segfaults).

(if you're not sure what the terms "reference" or "bin mapper" mean in the context of LightGBM, please tell me and I'll explain)

I think the root cause of this bug is actually right here:

set_reference = function(reference) {
# setting reference to this same Dataset object doesn't require any changes
if (identical(private$reference, reference)) {
return(invisible(self))
}

When private$reference is not NULL, it's an lgb.Dataset (an {R6} class instance).

library(lightgbm)

data("iris")

ds <- lgb.Dataset(
    as.matrix(iris[, seq_len(3L)])
    , label = iris[, 4L]
)

dtrain <- lgb.slice.Dataset(ds, seq_len(100L))
dtrain$.__enclos_env__$private$reference
# <lgb.Dataset>
#   Public:
#     construct: function () 
#     create_valid: function (data, label = NULL, weight = NULL, group = NULL, init_score = NULL, 

Using identical() on 2 {R6} class instances is not reliable... it can be affected by implementation details of how {R6} works:

But for that check, it does not matter whether they're the same exact R object (in fact, we can be confident that they aren't)... it's just important that their handles point to the same Dataset on the C++ side.

I don't really want to take on maintaining or implementing an ==.lgb.Dataset() generic here (as suggested in r-lib/R6#211)... for this purpose, I think we should try an internal function that only checks the equality of the characteristics we care about.

Maybe something like this in lgb.Dataset.R (you may have to experiment with this a bit):

.datasets_are_equal <- function(ds1, ds2) {
    if (is.null(ds1) && is.null(ds2)) {
        return(TRUE)
    }
    if (is.null(ds1) || is.null(ds2)) {
        return(FALSE)
    }
    return(
        identical(
            ds1$.__enclos_env__$private$get_handle()
            , ds1$.__enclos_env__$private$get_handle()
        )
    )
}

And then replacing identical() with a call to that function in the spot I linked to above.

Are you interested in attempting that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate you taking the time to break that down! This makes sense so far. I’d like to make an attempt at those changes so I can understand the codebase better. Thanks!

@jameslamb jameslamb changed the title [R-package] Allow for valid data to be a slice of the main lgb.Dataset [R-package] Allow for valid data to be a slice of the training data Mar 7, 2025
}

# The new reference and the existing object may be slices of a parent dataset
if (.datasets_are_equal(private$reference, reference$.__enclos_env__$private$reference)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So adding .datasets_are_equal() and running these two if {} checks does "fix" the issue in the sense that we pass all unit tests. However, this line in particular highlights a deeper problem with this special case where data and valids are slices of some other dataset. Please correct me if I am misunderstanding here, but I don't believe it is enough to check whether data and valids share the same handle in this case (line 661) but we also need to check whether these two datasets share a parent reference (line 666).

However, if we create a slice of a slice, the suggested changes above still break. For example:

ds <- lgb.Dataset(
  as.matrix(iris[, seq_len(3L)])
  , label = iris[, 4L]
)

train <- lgb.slice.Dataset(ds, seq_len(100L))
test <- lgb.slice.Dataset(ds, seq_len(50L) + 100L)

# Create a slice of a slice
test2 <- lgb.slice.Dataset(test, seq_len(50L))

model <- lgb.train(
  params = list(objective = "regression"),
  , data = train
  , nrounds = 1L
  , valids = list(test2 = test2)
)

#> Error in `valid_data$set_reference()`:
#> ! set_reference: cannot set reference after freeing raw data,
#>  please set 'free_raw_data = FALSE' when you construct lgb.Dataset

So I think we have a few options:

  • Only allow slices one level deep and create a new, more informative error if that is not the case
  • Create a recursive or similar system to check for deeper references
  • Something else entirely! I am still becoming more familiar with how references work in LightGBM so its possible that there is some cleaner or more obvious solution I am missing there.

What are your thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working with slices-of-slices of Dataset objects is not something that's well-tested in LightGBM. This is the first time I recall discussing it here or thinking about it.

Does slices-of-slices already fail in the same way on master? If they do, then the changes in this PR are still a net improvement, and would fix #6008. And if that's true, then I think we should move forward with this PR ignoring slices-of-slices, and I'd appreciate if you could write up a separate issue about how those are handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! I just created #6880 and will sit tight and wait for additional direction on this PR for now.

Comment on lines 26 to 33
.datasets_are_equal <- function(ds1, ds2) {
handler1 <- ds1$.__enclos_env__$private$get_handle
handler2 <- ds2$.__enclos_env__$private$get_handle
if (is.null(handler1) || is.null(handler2)) {
return(FALSE)
}
return(identical(handler1(), handler2()))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this implementation. I suspect you might not have understood what "handle" means here.

get_handle() is a function that returns a pointer:

get_handle = function() {
# Get handle and construct if needed
if (.is_null_handle(x = private$handle)) {
self$construct()
}
return(private$handle)
},

ds1$.__enclos_env__$private$get_handle will never be true... it will always be a function.

Also notice there that get_handle() will call self$construct() if the handle is NULL... so it should never return NULL.

I think that here, it would be better to just compare ds1$.__enclos_env__$private$handle to ds2$.__enclos_env__$private$handle. Can you please try that?

Copy link
Contributor Author

@walkerjameschris walkerjameschris Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and implemented that with the addition of .is_Dataset() checks at the top. This implementation avoids some of the confusion from before:

.datasets_are_equal <- function(ds1, ds2) {
  if (!.is_Dataset(ds1) || !.is_Dataset(ds2)) {
    return(FALSE)
  }
  return(identical(
    ds1$.__enclos_env__$private$handle
    , ds2$.__enclos_env__$private$handle
  ))
}

This is because I check if the reference shares the same handle as the lgb.Dataset however, the new reference may share the same parent handle as the lgb.Dataset. This gets into some of the issues I touched on in #6880. For now, we are checking one level deep. If the new reference getting passed in here does not have a parent reference, this will be NULL and will not have a $handle on the second if () {} here:

if (.datasets_are_equal(private$reference, reference)) {
  return(invisible(self))
}

# The new reference and the existing object may be slices of a parent dataset
if (.datasets_are_equal(private$reference, reference$.__enclos_env__$private$reference)) {
  return(invisible(self))
}

Additionally, if I don't include the .is_Dataset() checks there are many test failures saying:

[LightGBM] [Fatal] Cannot add validation data, since it has different bin mappers with training data

@StrikerRUS
Copy link
Collaborator

/AzurePipelines run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R-package] cannot uses slices of the same Dataset as training and validation datasets
3 participants