Skip to content

SorbetColumnDescriptors requires that Schema's be in a particular order #9855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jleibs opened this issue Apr 30, 2025 · 2 comments · Fixed by #9934
Closed

SorbetColumnDescriptors requires that Schema's be in a particular order #9855

jleibs opened this issue Apr 30, 2025 · 2 comments · Fixed by #9934
Assignees
Labels
🪳 bug Something isn't working

Comments

@jleibs
Copy link
Member

jleibs commented Apr 30, 2025

The try_from_arrow_fields builder depends on a very specific ordering of fields.

See:

match column_kind {
ColumnKind::RowId => {
if indices.is_empty() && components.is_empty() {
row_ids.push(RowIdColumnDescriptor::try_from(field)?);
} else {
return Err(SorbetError::custom("RowId column must be the first column"));
}
}
ColumnKind::Index => {
if components.is_empty() {
indices.push(IndexColumnDescriptor::try_from(field)?);
} else {
return Err(SorbetError::custom(
"Index columns must come before any data columns",
));
}
}

While this ordering makes sense from a performance perspective it violates the "accept what users (or our dataplatform) throws at us" assumptions.

As part of:

We introduced a new helper: try_from_arrow_fields_forgiving but it would be nice if there behavior were the default unless there's a very specific reason for it being this way.

@jleibs jleibs added 👀 needs triage This issue needs to be triaged by the Rerun team 🪳 bug Something isn't working and removed 👀 needs triage This issue needs to be triaged by the Rerun team labels Apr 30, 2025
@abey79
Copy link
Member

abey79 commented May 1, 2025

The core issue for that is that SorbetBatch has two overlapping but ultimately distinct goals:

  • General-purpose wrapper over Arrow record batches, with facilities to de/serialize sorbet metadata (used in various places. Python SDK, redap browser, table display, etc.).
  • Stop-gap for the record-batch <-> chunk de/serialisation.

The constraint above (and weirdnesses such as SorbetBatch having a entity_path: Option<EntityPath>) stem from the latter.

A suggestion would be to apply a typestate-like pattern, to exploit the overlap but express the difference between these purposes:

  • SorbetBatch<Flex>: general purpose wrapper
  • SorbetBatch<Strict>: chunk de/serialisation

edit: should be SorbetBatch<Chunk> and SorbetBatch<Dataframe>, as per

//! An arrow record batch that follows a specific schema is called a [`SorbetBatch`].
//!
//! Some [`SorbetBatch`]es has even more constrained requirements, such as [`ChunkBatch`] and `DataframeBatch`.
//! * Every [`ChunkBatch`] is a [`SorbetBatch`].
//! * Every `DataframeBatch` is a [`SorbetBatch`].
//!
//! NOTE: `DataframeBatch` has not yet been implemented.

abey79 added a commit that referenced this issue May 1, 2025
…arch APIs (#9854)

### Related

* Fixes #9837
* Further issue to address:
  * #9853
  * #9855 

### What

Initial attempt to formalise component column selector, how they are
matched against a schema, and how they are expressed in our Python API.
Applied on dataset index creation/search APIs.

TODO:
- [x] use `AnyComponentColumn` in APIs
- [x] cleanup and fix type stubs

---------

Co-authored-by: Jeremy Leibs <[email protected]>
abey79 added a commit that referenced this issue May 1, 2025
…arch APIs (#9854)

### Related

* Fixes #9837
* Further issue to address:
  * #9853
  * #9855 

### What

Initial attempt to formalise component column selector, how they are
matched against a schema, and how they are expressed in our Python API.
Applied on dataset index creation/search APIs.

TODO:
- [x] use `AnyComponentColumn` in APIs
- [x] cleanup and fix type stubs

---------

Co-authored-by: Jeremy Leibs <[email protected]>
@emilk
Copy link
Member

emilk commented May 2, 2025

After a discussion with @abey79 we decided we "just" need to make SorbetBatch more general, allowing columns in any order, with each column being a

pub enum ColumnDescriptor {
    RowId(RowIdColumnDescriptor),
    Time(IndexColumnDescriptor),
    Component(ComponentColumnDescriptor),
}

and then we move the ordering-constraint of columns into the more strict ChunkBatch.

We won't need the planned DataframeBatch.

SorbetBatch should have a .kind() that can return Dataframe, or Chunk, or None.

There should also be a ChunkBatch::try_from(SorbetBatch) and a impl SorbetBatch { fn split_into_chunks(self) -> Result<Vec<ChunkBatch>> {…} } that splits a multi-entity recordbatch into chunks, for ingestion into the ChunkStore.

Related

@emilk emilk self-assigned this May 8, 2025
emilk added a commit that referenced this issue May 8, 2025
### Related
* Closes #9034

### Unblocks:
* #9855
* #9921
* #9922
@emilk emilk closed this as completed in 54ddc8c May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪳 bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants