remove arrow dependency from data skipping #132

zachschuermann · 2024-03-05T18:02:57Z

removed arrow dep from data skipping code. we were previously using filter_record_batch. Now instead of eagerly performing a filter, the new data skipping code:

materializes a selection vector for skipping files (actions) based on the predicate
visits the selection vector to return a boolean vec that the kernel can understand
AddRemoveVisitor consumes this selection vector to apply the filtering as the actions are iterated.

resolves #126

ADDITIONALLY: fixed bug in checkpoint reads. we now read appropriate add/removes for commits and only adds for checkpoints. and bugfix for simpleclient data extraction.

zachschuermann · 2024-03-08T21:16:03Z

kernel/src/scan/data_skipping.rs

            static ref STATS_EXPR: Expr = Expr::column("add.stats");
+            static ref FILTER_EXPR: Expr = Expr::column("predicate").distinct(Expr::literal(false));


note i did just a little bit of reorganization just to list things in the order they are used.

nicklan

Nice! This is awesome.

Can we also ensure we make arrow fully optional in Cargo.toml

kernel/src/actions/visitors.rs

nicklan · 2024-03-08T22:44:19Z

kernel/src/scan/data_skipping.rs

+            .extract(Arc::new(schema), &mut visitor)?;
+        Ok(visitor.selection_vector)
+
+        // TODO(zach): add some debug info about data skipping that occurred


this will need to happen at the higher level, or the selection vector could count the # of trues it has

meh kinda just left this open for now lol

kernel/src/scan/file_stream.rs

kernel/src/scan/data_skipping.rs

kernel/src/scan/file_stream.rs

ryan-johnson-databricks · 2024-03-08T22:19:53Z

kernel/src/scan/file_stream.rs

@@ -70,15 +86,13 @@ impl LogReplayScanner {
        actions: &dyn EngineData,
        is_log_batch: bool,
    ) -> DeltaResult<Vec<Add>> {


If we applied the JVM kernel approach here, this method would take the engine data as input, and return a selection vector that covers both data skipping and log replay for rows of this batch. The engine data would not be modified at all, and engine is free to apply the filtering however it wishes.

This also avoids the need to (pay the cost to) parse EngineData rows into Add and Remove structs, which in turn avoids the need to expose those structs as part of the public API.

return a selection vector that covers both data skipping and log replay for rows of this batch

Oh like the selection vector would just 'pick' which actions (which would all be Add actions) represent the valid files after data skipping/applying removes from the seen set?

This also avoids the need to (pay the cost to) parse EngineData rows into Add and Remove structs, which in turn avoids the need to expose those structs as part of the public API.

we still would need to inspect the path/dv in order to add to the 'seen' set and perform the log replay (remove action) filtering, correct? You're just suggesting to do this on-the-fly instead of parsing into structs?

Yes, the visitor isn't required to parse the exploded fields into a struct! It could just ask for just the fields we need to examine, update the selection vector accordingly, and let the engine data continue providing the actual rows.

ryan-johnson-databricks · 2024-03-08T22:26:07Z

kernel/src/scan/file_stream.rs

@@ -90,7 +104,7 @@ impl LogReplayScanner {
            // only serve as tombstones for vacuum jobs. So no need to load them here.
            vec![crate::actions::schemas::ADD_FIELD.clone()]
        });
-        let mut visitor = AddRemoveVisitor::default();
+        let mut visitor = AddRemoveVisitor::new(selection_vector);
        actions.extract(Arc::new(schema_to_use), &mut visitor)?;

        for remove in visitor.removes.into_iter() {


NOTE: We don't need to process removes before adds within any given batch, because each batch is a subset of some commit, and a given commit cannot legally contain more than one action for a given path. We just need to remember them in self.seen in case the next batch needs them.

ah yep +1 I can fix in a separate (small) PR?

What are we fixing though? It's good to know that we don't need to look at the removes first, but it's also not wrong.

The visitor has separated them for us already, so it's not any extra cost to do them first.

small nit for something I noticed below on line 98 (sorry, github won't let me comment there). You can remove a clone:

for remove in visitor.removes.into_iter() { let dv_id = remove.dv_unique_id(); self.seen.insert((remove.path, dv_id)); }

removed the clone :) and can open an issue if we want to modify the way we process removes?

What are we fixing though? It's good to know that we don't need to look at the removes first, but it's also not wrong.

It requires materializing (arbitrarily large) lists of adds and removes, which kernel shouldn't be in the business of doing in the first place. We should really be returning the original engine data along with updated selection vector, since the engine already went to the trouble of allocating all the actions for us there.

Got it. This is a more general "fix the way scan works" kind of change. If I understand correctly, your model would switch scan to an iterator, and not pass around Adds really ever. This code would only generate the selection vector, and return the batch as "underlying data + vector". Then the next method on scan would actually poke at the engine data to get the add file paths, dvs, and filter out removed adds, and return data as read by the engine. I've noted that in #123.

I'd say we merge this PR just to get arrow out, and then look at changing the return type of scan along the lines of #123 to make this more efficient.

@ryan-johnson-databricks does that make sense to you?

ryan-johnson-databricks · 2024-03-08T22:31:44Z

kernel/src/scan/file_stream.rs

@@ -90,7 +104,7 @@ impl LogReplayScanner {
            // only serve as tombstones for vacuum jobs. So no need to load them here.
            vec![crate::actions::schemas::ADD_FIELD.clone()]


I don't think this works -- the visitor blindly accesses getters[ADD_FIELD_COUNT], which will panic out of bounds if/when we ever have !is_log_batch. At a minimum, we need to preserve remove.path so the not-null check can skip it, but that would also require us to filter out removes from checkpoint parts at scan level, so that remove.path is always null (which also reduces the cost of the scan by not fetching those columns in the first place).

Ahh, great catch. Can't we just propagate is_log_batch into the visitor, and not try to look for removes if it's false?

That could work as a short-term mitigation, yes. Longer term, we want the schema filtering to happen higher up in the stack, so there isn't even a remove column and the add.path IS NOT NULL check is pushed down into the scan.

fixed with short-term here :)

Co-authored-by: Nick Lanham <[email protected]>

Co-authored-by: Ryan Johnson <[email protected]>

ryan-johnson-databricks

AFAIK, the actual changes from this PR are good and badly needed. All outstanding concerns are pre-existing issues or future work that shouldn't delay merge.

ryan-johnson-databricks · 2024-03-13T22:38:16Z

kernel/src/scan/file_stream.rs

+                if !self
+                    .selection_vector
+                    .as_ref()
+                    .is_some_and(|selection| !selection[i])
+                {
+                    self.adds
+                        .push(AddVisitor::visit_add(i, path, &getters[..ADD_FIELD_COUNT])?)


Gotta love functional languages, where the "simpler" approach actually produces more lines of code 🤦

ryan-johnson-databricks · 2024-03-14T19:48:05Z

kernel/src/snapshot.rs

            .map_ok(|batch| (batch, true));

        let parquet_client = engine_interface.get_parquet_handler();
        let checkpoint_stream = parquet_client
-            .read_parquet_files(&self.checkpoint_files, read_schema, predicate)?
+            .read_parquet_files(&self.checkpoint_files, checkpoint_read_schema, predicate)?


I guess still TODO to push the not-null predicates down to the scan for columns that survived pruning?

yea adding a comment and i'll open an issue

zachschuermann · 2024-03-14T19:58:43Z

kernel/src/scan/file_stream.rs

@@ -90,12 +111,12 @@ impl LogReplayScanner {
            // only serve as tombstones for vacuum jobs. So no need to load them here.
            vec![crate::actions::schemas::ADD_FIELD.clone()]
        });
-        let mut visitor = AddRemoveVisitor::default();


checkpoint bugfix: passing is_log_batch down to visitor

zachschuermann · 2024-03-14T19:59:13Z

kernel/src/simple_client/data.rs

+            //   b) recursed into a optional struct that was null. In this case, array.is_none() is
+            //      true and we don't need to check field nullability, because we assume all fields
+            //      of a nullable struct can be null
+            // So below if the field is allowed to be null, OR array.is_none() we push that,
+            // otherwise we error out.
            if let Some(col) = col {
                Self::extract_column(out_col_array, field, col)?;
-            } else if field.is_nullable() {
-                if let DataType::Struct(_) = field.data_type() {
-                    Self::extract_columns_from_array(out_col_array, schema, None)?;
+            } else if array.is_none() || field.is_nullable() {
+                if let DataType::Struct(inner_struct) = field.data_type() {
+                    Self::extract_columns_from_array(out_col_array, inner_struct.as_ref(), None)?;


from nick :)

nicklan

nice thanks, lgtm

wip

d948434

zachschuermann requested a review from nicklan March 5, 2024 18:02

zachschuermann self-assigned this Mar 5, 2024

zachschuermann added 3 commits March 8, 2024 12:11

checkpoint

42f3f3d

cleanup

aebf692

fmt

aba9d83

zachschuermann marked this pull request as ready for review March 8, 2024 21:05

zachschuermann commented Mar 8, 2024

View reviewed changes

zachschuermann requested review from roeap and ryan-johnson-databricks March 8, 2024 21:17

nicklan requested changes Mar 8, 2024

View reviewed changes

ryan-johnson-databricks reviewed Mar 11, 2024

View reviewed changes

zachschuermann and others added 6 commits March 12, 2024 13:37

remove visit_bool

ba7fc7e

Co-authored-by: Nick Lanham <[email protected]>

use fancy is_some_and

7bee3a6

Co-authored-by: Ryan Johnson <[email protected]>

address comments

5ee02c3

whoops

9758b28

Co-authored-by: Ryan Johnson <[email protected]>

fix simple client

7ce02ce

fix out of bounds error, fix extractor, add test for checkpoint read

30ae91f

ryan-johnson-databricks approved these changes Mar 14, 2024

View reviewed changes

ryan-johnson-databricks reviewed Mar 14, 2024

View reviewed changes

comment

3be3279

zachschuermann commented Mar 14, 2024

View reviewed changes

zachschuermann requested a review from nicklan March 14, 2024 19:59

nicklan approved these changes Mar 14, 2024

View reviewed changes

zachschuermann merged commit 6fa958a into delta-io:main Mar 14, 2024

		static ref STATS_EXPR: Expr = Expr::column("add.stats");
		static ref FILTER_EXPR: Expr = Expr::column("predicate").distinct(Expr::literal(false));

		@@ -90,7 +104,7 @@ impl LogReplayScanner {
		// only serve as tombstones for vacuum jobs. So no need to load them here.
		vec![crate::actions::schemas::ADD_FIELD.clone()]

remove arrow dependency from data skipping #132

remove arrow dependency from data skipping #132

Uh oh!

Conversation

zachschuermann commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryan-johnson-databricks Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicklan Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zachschuermann commented Mar 5, 2024 •

edited

Loading

ryan-johnson-databricks Mar 12, 2024 •

edited

Loading

nicklan Mar 12, 2024 •

edited

Loading