[R-package] Allow for valid data to be a slice of the training data #6844

walkerjameschris · 2025-02-28T02:23:51Z

Slices of datasets already contain a reference to the base data. This change only sets a reference to the base dataset if one does not already exist.

Slices of datasets already contain a reference to the base data (e.g., R/lgb.Dataset.R:562). This change only sets a reference to the base dataset if one does not already exist.

jameslamb

Thanks for taking the time to contribute!

I've added a link to #6008 to the description... please do that when you contribute in the future and where your contribution is related to previous discussions. It helps to answer the question "what is the benefit of this change?" for reviewers.

Before we review this closely (I haven't yet), please also add a unit test covering this behavior. That'll help build our confidence in this fix, and prevent the bug from being reintroduced in the future.

Add that somewhere in https://github.com/microsoft/LightGBM/blob/master/R-package/tests/testthat/test_basic.R. Try to put it near related tests, and follow all the patterns you see there (for code style, passing .LGB_VERBOSITY. and .LGB_MAX_THREADS, etc.).

walkerjameschris · 2025-02-28T04:13:10Z

@microsoft-github-policy-service agree

walkerjameschris · 2025-03-04T21:58:27Z

Hey @jameslamb!

I wanted to clarify whether I should be periodically updating my branch when this PR becomes out of date with the base branch. Also, I see one of the tests failed but it says it was related to knitr which I don’t think is impacted by my PR, especially given all other checks passed.

Let me know your thoughts! Really appreciate the opportunity to contribute to LightGBM 😄

jameslamb · 2025-03-07T04:10:35Z

I wanted to clarify whether I should be periodically updating my branch when this PR becomes out of date with the base branch.

Good question!

The answer to that might be different in other projects, but here in LightGBM it's fine to update this branch whenever you want... our CI is fairly cheap to run, and the project rarely has more than 5 concurrent active builds happening at the same time.

And besides just being "fine", it is generally helpful... staying up to date improves our confidence that merging the PR is safe.

Also, I see one of the tests failed but it says it was related to knitr which I don’t think is impacted by my PR, especially given all other checks passed.

Yeah I agree, it's not related to your PR. I've documented this at #6855

jameslamb

Thanks very much for taking the time to contribute! Please see the comment I left... I think this problem has a different root cause and requires a different approach than what's currently proposed in the diff.

jameslamb · 2025-03-07T05:40:42Z

R-package/R/lgb.train.R

+      # No need to set reference if one exists
+      if (is.null(valid_data$.__enclos_env__$private$reference)) {
+        valid_data$set_reference(data)
+      }


I don't think this is quite right.

This would result in not attempting the update if the validation Dataset has any reference. That's a problem, because it means that if the training data and validation data have different bin mappers, we might not find that out until much later in the program (and probably with hard-to-understand errors or maybe even more serious problems like segfaults).

(if you're not sure what the terms "reference" or "bin mapper" mean in the context of LightGBM, please tell me and I'll explain)

I think the root cause of this bug is actually right here:

LightGBM/R-package/R/lgb.Dataset.R

Lines 647 to 652 in 6437645

set_reference = function(reference) {

# setting reference to this same Dataset object doesn't require any changes

if (identical(private$reference, reference)) {

return(invisible(self))

}

When private$reference is not NULL, it's an lgb.Dataset (an {R6} class instance).

library(lightgbm) data("iris") ds <- lgb.Dataset( as.matrix(iris[, seq_len(3L)]) , label = iris[, 4L] ) dtrain <- lgb.slice.Dataset(ds, seq_len(100L)) dtrain$.__enclos_env__$private$reference # <lgb.Dataset> # Public: # construct: function () # create_valid: function (data, label = NULL, weight = NULL, group = NULL, init_score = NULL,

Using identical() on 2 {R6} class instances is not reliable... it can be affected by implementation details of how {R6} works:

Implement all.equal.R6 to cope with handling of ABs in recent R versions r-lib/R6#208

Comparing two objects of the same R6 class for equality r-lib/R6#211

But for that check, it does not matter whether they're the same exact R object (in fact, we can be confident that they aren't)... it's just important that their handles point to the same Dataset on the C++ side.

I don't really want to take on maintaining or implementing an ==.lgb.Dataset() generic here (as suggested in r-lib/R6#211)... for this purpose, I think we should try an internal function that only checks the equality of the characteristics we care about.

Maybe something like this in lgb.Dataset.R (you may have to experiment with this a bit):

.datasets_are_equal <- function(ds1, ds2) { if (is.null(ds1) && is.null(ds2)) { return(TRUE) } if (is.null(ds1) || is.null(ds2)) { return(FALSE) } return( identical( ds1$.__enclos_env__$private$get_handle() , ds1$.__enclos_env__$private$get_handle() ) ) }

And then replacing identical() with a call to that function in the spot I linked to above.

Are you interested in attempting that?

I appreciate you taking the time to break that down! This makes sense so far. I’d like to make an attempt at those changes so I can understand the codebase better. Thanks!

walkerjameschris · 2025-03-12T12:45:57Z

R-package/R/lgb.Dataset.R

+      }
+
+      # The new reference and the existing object may be slices of a parent dataset
+      if (.datasets_are_equal(private$reference, reference$.__enclos_env__$private$reference)) {


So adding .datasets_are_equal() and running these two if {} checks does "fix" the issue in the sense that we pass all unit tests. However, this line in particular highlights a deeper problem with this special case where data and valids are slices of some other dataset. Please correct me if I am misunderstanding here, but I don't believe it is enough to check whether data and valids share the same handle in this case (line 661) but we also need to check whether these two datasets share a parent reference (line 666).

However, if we create a slice of a slice, the suggested changes above still break. For example:

ds <- lgb.Dataset( as.matrix(iris[, seq_len(3L)]) , label = iris[, 4L] ) train <- lgb.slice.Dataset(ds, seq_len(100L)) test <- lgb.slice.Dataset(ds, seq_len(50L) + 100L) # Create a slice of a slice test2 <- lgb.slice.Dataset(test, seq_len(50L)) model <- lgb.train( params = list(objective = "regression"), , data = train , nrounds = 1L , valids = list(test2 = test2) ) #> Error in `valid_data$set_reference()`: #> ! set_reference: cannot set reference after freeing raw data, #> please set 'free_raw_data = FALSE' when you construct lgb.Dataset

So I think we have a few options:

Only allow slices one level deep and create a new, more informative error if that is not the case

Create a recursive or similar system to check for deeper references

Something else entirely! I am still becoming more familiar with how references work in LightGBM so its possible that there is some cleaner or more obvious solution I am missing there.

What are your thoughts?

Working with slices-of-slices of Dataset objects is not something that's well-tested in LightGBM. This is the first time I recall discussing it here or thinking about it.

Does slices-of-slices already fail in the same way on master? If they do, then the changes in this PR are still a net improvement, and would fix #6008. And if that's true, then I think we should move forward with this PR ignoring slices-of-slices, and I'd appreciate if you could write up a separate issue about how those are handled.

Gotcha! I just created #6880 and will sit tight and wait for additional direction on this PR for now.

jameslamb · 2025-04-05T01:15:20Z

R-package/R/lgb.Dataset.R

+.datasets_are_equal <- function(ds1, ds2) {
+  handler1 <- ds1$.__enclos_env__$private$get_handle
+  handler2 <- ds2$.__enclos_env__$private$get_handle
+  if (is.null(handler1) || is.null(handler2)) {
+    return(FALSE)
+  }
+  return(identical(handler1(), handler2()))
+}


I'm confused by this implementation. I suspect you might not have understood what "handle" means here.

get_handle() is a function that returns a pointer:

LightGBM/R-package/R/lgb.Dataset.R

Lines 716 to 724 in 82c85e4

get_handle = function() {

# Get handle and construct if needed

if (.is_null_handle(x = private$handle)) {

self$construct()

}

return(private$handle)

},

ds1$.__enclos_env__$private$get_handle will never be true... it will always be a function.

Also notice there that get_handle() will call self$construct() if the handle is NULL... so it should never return NULL.

I think that here, it would be better to just compare ds1$.__enclos_env__$private$handle to ds2$.__enclos_env__$private$handle. Can you please try that?

I went ahead and implemented that with the addition of .is_Dataset() checks at the top. This implementation avoids some of the confusion from before:

.datasets_are_equal <- function(ds1, ds2) { if (!.is_Dataset(ds1) || !.is_Dataset(ds2)) { return(FALSE) } return(identical( ds1$.__enclos_env__$private$handle , ds2$.__enclos_env__$private$handle )) }

This is because I check if the reference shares the same handle as the lgb.Dataset however, the new reference may share the same parent handle as the lgb.Dataset. This gets into some of the issues I touched on in #6880. For now, we are checking one level deep. If the new reference getting passed in here does not have a parent reference, this will be NULL and will not have a $handle on the second if () {} here:

if (.datasets_are_equal(private$reference, reference)) { return(invisible(self)) } # The new reference and the existing object may be slices of a parent dataset if (.datasets_are_equal(private$reference, reference$.__enclos_env__$private$reference)) { return(invisible(self)) }

Additionally, if I don't include the .is_Dataset() checks there are many test failures saying:

[LightGBM] [Fatal] Cannot add validation data, since it has different bin mappers with training data

StrikerRUS · 2025-04-24T09:47:15Z

/AzurePipelines run

Only set reference if one does not exist

1c3699f

Slices of datasets already contain a reference to the base data (e.g., R/lgb.Dataset.R:562). This change only sets a reference to the base dataset if one does not already exist.

walkerjameschris requested review from guolinke, jameslamb, shiyu1994, jmoralez, borchero and StrikerRUS as code owners February 28, 2025 02:23

walkerjameschris mentioned this pull request Feb 28, 2025

[R-package] cannot uses slices of the same Dataset as training and validation datasets #6008

Open

jameslamb changed the title ~~Allow for valid data to be a slice of the main lgb.Dataset by setting references if one does not exist.~~ [R-package] Allow for valid data to be a slice of the main lgb.Dataset by setting references if one does not exist. Feb 28, 2025

jameslamb requested changes Feb 28, 2025

View reviewed changes

jameslamb added in progress fix labels Feb 28, 2025

Adding unit test for training on slices of ldb.Dataset

b6ac834

walkerjameschris and others added 4 commits February 28, 2025 22:03

Correcting whitespace

25d681b

Adding types for constants

6b1add4

Adding integer qualifier

1153a10

Merge branch 'master' into fix/slice-reference

97a6691

Merge branch 'master' into fix/slice-reference

d39bc5d

jameslamb changed the title ~~[R-package] Allow for valid data to be a slice of the main lgb.Dataset by setting references if one does not exist.~~ [R-package] Allow for valid data to be a slice of the main lgb.Dataset Mar 7, 2025

jameslamb requested changes Mar 7, 2025

View reviewed changes

jameslamb changed the title ~~[R-package] Allow for valid data to be a slice of the main lgb.Dataset~~ [R-package] Allow for valid data to be a slice of the training data Mar 7, 2025

Reverting lgb.train(), using helper in lgb.Dataset

151f0ce

walkerjameschris commented Mar 12, 2025

View reviewed changes

Merge branch 'master' into fix/slice-reference

3caddb0

walkerjameschris mentioned this pull request Apr 4, 2025

[R-package] Slices-of-slices of lgb.Dataset result in error when used as valids #6880

Open

jameslamb requested changes Apr 5, 2025

View reviewed changes

Making dataset equality check simpler

5ec6dd1

walkerjameschris and others added 2 commits April 20, 2025 18:11

Merge branch 'master' into fix/slice-reference

3b90142

Merge branch 'master' into fix/slice-reference

5245aac

walkerjameschris and others added 3 commits April 26, 2025 23:19

Merge branch 'master' into fix/slice-reference

0a977bb

Merge branch 'master' into fix/slice-reference

bde53c1

Merge branch 'master' into fix/slice-reference

0310752

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[R-package] Allow for valid data to be a slice of the training data #6844

[R-package] Allow for valid data to be a slice of the training data #6844

Uh oh!

walkerjameschris commented Feb 28, 2025 •

edited by jameslamb

Loading

Uh oh!

jameslamb left a comment

Uh oh!

walkerjameschris commented Feb 28, 2025

Uh oh!

walkerjameschris commented Mar 4, 2025

Uh oh!

jameslamb commented Mar 7, 2025

Uh oh!

jameslamb left a comment

Uh oh!

jameslamb Mar 7, 2025

Uh oh!

walkerjameschris Mar 7, 2025

Uh oh!

walkerjameschris Mar 12, 2025

Uh oh!

jameslamb Apr 4, 2025

Uh oh!

walkerjameschris Apr 4, 2025

Uh oh!

jameslamb Apr 5, 2025

Uh oh!

walkerjameschris Apr 10, 2025 •

edited

Loading

Uh oh!

StrikerRUS commented Apr 24, 2025

Uh oh!

Uh oh!

	set_reference = function(reference) {

	# setting reference to this same Dataset object doesn't require any changes
	if (identical(private$reference, reference)) {
	return(invisible(self))
	}

	get_handle = function() {

	# Get handle and construct if needed
	if (.is_null_handle(x = private$handle)) {
	self$construct()
	}
	return(private$handle)

	},

[R-package] Allow for valid data to be a slice of the training data #6844

Are you sure you want to change the base?

[R-package] Allow for valid data to be a slice of the training data #6844

Uh oh!

Conversation

walkerjameschris commented Feb 28, 2025 • edited by jameslamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

walkerjameschris commented Feb 28, 2025

Uh oh!

walkerjameschris commented Mar 4, 2025

Uh oh!

jameslamb commented Mar 7, 2025

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

jameslamb Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

walkerjameschris Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

walkerjameschris Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

jameslamb Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

walkerjameschris Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

jameslamb Apr 5, 2025

Choose a reason for hiding this comment

Uh oh!

walkerjameschris Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StrikerRUS commented Apr 24, 2025

Uh oh!

Uh oh!

walkerjameschris commented Feb 28, 2025 •

edited by jameslamb

Loading

walkerjameschris Apr 10, 2025 •

edited

Loading