-
Notifications
You must be signed in to change notification settings - Fork 940
Add optional handling for binary arrays in json #4967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional handling for binary arrays in json #4967
Conversation
This might be a stupid question, but why does this need to be implemented in the JSON reader, as opposed to processing the StringArray after the fact? |
This could all be the product of me not understanding my options. I started looking into this as part of work on the delta-rs project. I was working on a kubernetes operator to do some maintenance when I encountered the ArrowJson error related to binary columns. It turns out we can't run checkpoints on tables with binary columns. |
Could you provide an example of what the contents of this binary file are encoded as? Assuming the data is valid JSON, they must be valid UTF-8, and can therefore be read as a regular StringArray, i.e. DataType::Utf8. Now the question is what does the downstream expect, e.g. if the data in the JSON file is actually base64 encoded, do they expect a BinaryArray with this data decoded? Or is the fact the data is labelled binary just a schema error and the data is actually UTF-8 but the charset has just been lost somewhere down the line? This will determine what post-processing on the output is necessary, if any. |
The example is a log file for a delta table. The log entries are in json and they contain a stats field that lists the min/max for each column in the schema. delta-rs writes the contents of binary columns as a serialized string using the below method
When creating a checkpoint, it takes that accumulated json from the various log files that looks something like this |
IMO that older PR does NOT solve the problem mentioned here. That PR gives a way to cast from binary to string (that happens to contain base64-encded binary data). But it's still casting, which means somebody has to manipulate schemas on both read and write path. Arrow JSON still doesn't have any actual support for encoding and decoding the arrow binary type. That said, there's definitely an argument to be made that round trip encoding needs domain knowledge (ie does the sender and/or receiver expect hex? base64? ascii85? which variant? etc. So maybe manual casts are the best arrow itself can expose, to avoid getting sucked into that rabbit hole? |
Which issue does this PR close?
Closes #4945 .
Rationale for this change
There is no native binary representation for json, so
arrow-json
simply returns an error when creating a decoder. The suggestion in the issue was to handle it manually, but there was no way to opt in to this behavior in the library. This allows a consumer to opt in to binary serialization, using their desired conversion method, or keep the default behavior.What changes are included in this PR?
A user can implement the
BinaryDecoder
trait toAre there any user-facing changes?
The addition of
pub fn with_binary_decoding(self, binary_decoder: BinaryDecoderProvider)
to theReaderBuilder
implementation inarrow-json
.New
BinaryArrayDecoder
trait inarrow-json
.