Skip to content

Add optional handling for binary arrays in json #4967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

mightyshazam
Copy link

Which issue does this PR close?

Closes #4945 .

Rationale for this change

There is no native binary representation for json, so arrow-json simply returns an error when creating a decoder. The suggestion in the issue was to handle it manually, but there was no way to opt in to this behavior in the library. This allows a consumer to opt in to binary serialization, using their desired conversion method, or keep the default behavior.

What changes are included in this PR?

A user can implement the BinaryDecoder trait to

Are there any user-facing changes?

The addition of pub fn with_binary_decoding(self, binary_decoder: BinaryDecoderProvider) to the ReaderBuilder implementation in arrow-json.
New BinaryArrayDecoder trait in arrow-json.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 20, 2023
@tustvold
Copy link
Contributor

This might be a stupid question, but why does this need to be implemented in the JSON reader, as opposed to processing the StringArray after the fact?

@mightyshazam
Copy link
Author

This might be a stupid question, but why does this need to be implemented in the JSON reader, as opposed to processing the StringArray after the fact?

This could all be the product of me not understanding my options. I started looking into this as part of work on the delta-rs project. I was working on a kubernetes operator to do some maintenance when I encountered the ArrowJson error related to binary columns. It turns out we can't run checkpoints on tables with binary columns.
This was due to the call to build_decoder in the ReaderBuilder. After a little research, I encountered the issue and your comments about handling it manually. Instead of creating a new reader for delta-rs, I thought it might be more reasonable to implement this as optional functionality in the arrow library.
I know it is probably a pretty niche case, but it was also a relatively simple way forward. If you can think of a better alternative, I am definitely open to that. This just turned out to be a pretty easy way forward.
Hopefully, that context makes sense.

@tustvold
Copy link
Contributor

tustvold commented Oct 20, 2023

Could you provide an example of what the contents of this binary file are encoded as? Assuming the data is valid JSON, they must be valid UTF-8, and can therefore be read as a regular StringArray, i.e. DataType::Utf8.

Now the question is what does the downstream expect, e.g. if the data in the JSON file is actually base64 encoded, do they expect a BinaryArray with this data decoded? Or is the fact the data is labelled binary just a schema error and the data is actually UTF-8 but the charset has just been lost somewhere down the line? This will determine what post-processing on the output is necessary, if any.

@mightyshazam
Copy link
Author

The example is a log file for a delta table. The log entries are in json and they contain a stats field that lists the min/max for each column in the schema. delta-rs writes the contents of binary columns as a serialized string using the below method

let escaped_bytes = v
                    .into_iter()
                    .flat_map(std::ascii::escape_default)
                    .collect::<Vec<u8>>();
let escaped_string = String::from_utf8(escaped_bytes).unwrap();

When creating a checkpoint, it takes that accumulated json from the various log files that looks something like this {"add":{"path":"part-00001-8a7f1b6b-5869-4c37-8760-462ee9c97d49-c000.parquet","size":1522,"partitionValues":{},"modificationTime":1697658217285,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"tenant_model_id\":\"00000004\",\"content\":\"00000004\",\"job_id\":\"00000004\",\"platform_model_id\":\"00000004\"},\"maxValues\":{\"tenant_model_id\":\"00000004\",\"content\":\"00000004\",\"job_id\":\"00000004\",\"platform_model_id\":\"00000004\"},\"nullCount\":{\"job_id\":0,\"content\":0,\"tenant_model_id\":0,\"platform_model_id\":0}}"}} where the content field is the binary field in this case.
The checkpoint code takes a series of json entries like the one above and uses the ReaderBuilder's serialize method to turn that into an RecordBatch since the checkpoints are in parquet format.
The issue isn't that the schema is in correct, but the usage may be. It might make sense to convert the delta schema to an arrow schema with utf-8 columns because that is what is ultimately written. However, the code as is just converts directly from a delta schema to an arrow schema and builds the decoder. Due to the lack of binary column support, the method returns an error instead of a decoder.

@scovich
Copy link
Contributor

scovich commented Feb 13, 2025

IMO that older PR does NOT solve the problem mentioned here. That PR gives a way to cast from binary to string (that happens to contain base64-encded binary data). But it's still casting, which means somebody has to manipulate schemas on both read and write path. Arrow JSON still doesn't have any actual support for encoding and decoding the arrow binary type.

That said, there's definitely an argument to be made that round trip encoding needs domain knowledge (ie does the sender and/or receiver expect hex? base64? ascii85? which variant? etc. So maybe manual casts are the best arrow itself can expose, to avoid getting sucked into that rabbit hole?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

arrow_json: support binary deserialization
3 participants