SorbetColumnDescriptors requires that Schema's be in a particular order #9855

jleibs · 2025-04-30T18:09:24Z

The try_from_arrow_fields builder depends on a very specific ordering of fields.

See:

rerun/crates/store/re_sorbet/src/sorbet_columns.rs

Lines 139 to 156 in 9b8dbe6

    
           match column_kind { 
        
               ColumnKind::RowId => { 
        
                   if indices.is_empty() && components.is_empty() { 
        
                       row_ids.push(RowIdColumnDescriptor::try_from(field)?); 
        
                   } else { 
        
                       return Err(SorbetError::custom("RowId column must be the first column")); 
        
                   } 
        
               } 
        
               ColumnKind::Index => { 
        
                   if components.is_empty() { 
        
                       indices.push(IndexColumnDescriptor::try_from(field)?); 
        
                   } else { 
        
                       return Err(SorbetError::custom( 
        
                           "Index columns must come before any data columns", 
        
                       )); 
        
                   } 
        
               }

While this ordering makes sense from a performance perspective it violates the "accept what users (or our dataplatform) throws at us" assumptions.

As part of:

Properly resolve component selectors in dataset index creation and search APIs #9854

We introduced a new helper: try_from_arrow_fields_forgiving but it would be nice if there behavior were the default unless there's a very specific reason for it being this way.

The text was updated successfully, but these errors were encountered:

abey79 · 2025-05-01T07:34:11Z

The core issue for that is that SorbetBatch has two overlapping but ultimately distinct goals:

General-purpose wrapper over Arrow record batches, with facilities to de/serialize sorbet metadata (used in various places. Python SDK, redap browser, table display, etc.).
Stop-gap for the record-batch <-> chunk de/serialisation.

The constraint above (and weirdnesses such as SorbetBatch having a entity_path: Option<EntityPath>) stem from the latter.

A suggestion would be to apply a typestate-like pattern, to exploit the overlap but express the difference between these purposes:

SorbetBatch<Flex>: general purpose wrapper
SorbetBatch<Strict>: chunk de/serialisation

edit: should be SorbetBatch<Chunk> and SorbetBatch<Dataframe>, as per

rerun/crates/store/re_sorbet/src/lib.rs

Lines 5 to 11 in 1a48a46

    
           //! An arrow record batch that follows a specific schema is called a [`SorbetBatch`]. 
        
           //! 
        
           //! Some [`SorbetBatch`]es has even more constrained requirements, such as [`ChunkBatch`] and `DataframeBatch`. 
        
           //! * Every [`ChunkBatch`] is a [`SorbetBatch`]. 
        
           //! * Every `DataframeBatch` is a [`SorbetBatch`]. 
        
           //! 
        
           //! NOTE: `DataframeBatch` has not yet been implemented.

…arch APIs (#9854) ### Related * Fixes #9837 * Further issue to address: * #9853 * #9855 ### What Initial attempt to formalise component column selector, how they are matched against a schema, and how they are expressed in our Python API. Applied on dataset index creation/search APIs. TODO: - [x] use `AnyComponentColumn` in APIs - [x] cleanup and fix type stubs --------- Co-authored-by: Jeremy Leibs <[email protected]>

emilk · 2025-05-02T08:24:00Z

After a discussion with @abey79 we decided we "just" need to make SorbetBatch more general, allowing columns in any order, with each column being a

pub enum ColumnDescriptor {
    RowId(RowIdColumnDescriptor),
    Time(IndexColumnDescriptor),
    Component(ComponentColumnDescriptor),
}

and then we move the ordering-constraint of columns into the more strict ChunkBatch.

We won't need the planned DataframeBatch.

SorbetBatch should have a .kind() that can return Dataframe, or Chunk, or None.

There should also be a ChunkBatch::try_from(SorbetBatch) and a impl SorbetBatch { fn split_into_chunks(self) -> Result<Vec<ChunkBatch>> {…} } that splits a multi-entity recordbatch into chunks, for ingestion into the ChunkStore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SorbetColumnDescriptors requires that Schema's be in a particular order #9855

SorbetColumnDescriptors requires that Schema's be in a particular order #9855

jleibs commented Apr 30, 2025

abey79 commented May 1, 2025 •

edited

Loading

emilk commented May 2, 2025 •

edited

Loading

SorbetColumnDescriptors requires that Schema's be in a particular order #9855

SorbetColumnDescriptors requires that Schema's be in a particular order #9855

Comments

jleibs commented Apr 30, 2025

abey79 commented May 1, 2025 • edited Loading

emilk commented May 2, 2025 • edited Loading

Related

abey79 commented May 1, 2025 •

edited

Loading

emilk commented May 2, 2025 •

edited

Loading