Add optional handling for binary arrays in json #4967

mightyshazam · 2023-10-20T18:03:57Z

Which issue does this PR close?

Closes #4945 .

Rationale for this change

There is no native binary representation for json, so arrow-json simply returns an error when creating a decoder. The suggestion in the issue was to handle it manually, but there was no way to opt in to this behavior in the library. This allows a consumer to opt in to binary serialization, using their desired conversion method, or keep the default behavior.

What changes are included in this PR?

A user can implement the BinaryDecoder trait to

Are there any user-facing changes?

The addition of pub fn with_binary_decoding(self, binary_decoder: BinaryDecoderProvider) to the ReaderBuilder implementation in arrow-json.
New BinaryArrayDecoder trait in arrow-json.

tustvold · 2023-10-20T21:29:07Z

This might be a stupid question, but why does this need to be implemented in the JSON reader, as opposed to processing the StringArray after the fact?

mightyshazam · 2023-10-20T22:00:25Z

This might be a stupid question, but why does this need to be implemented in the JSON reader, as opposed to processing the StringArray after the fact?

This could all be the product of me not understanding my options. I started looking into this as part of work on the delta-rs project. I was working on a kubernetes operator to do some maintenance when I encountered the ArrowJson error related to binary columns. It turns out we can't run checkpoints on tables with binary columns.
This was due to the call to build_decoder in the ReaderBuilder. After a little research, I encountered the issue and your comments about handling it manually. Instead of creating a new reader for delta-rs, I thought it might be more reasonable to implement this as optional functionality in the arrow library.
I know it is probably a pretty niche case, but it was also a relatively simple way forward. If you can think of a better alternative, I am definitely open to that. This just turned out to be a pretty easy way forward.
Hopefully, that context makes sense.

tustvold · 2023-10-20T22:05:48Z

Could you provide an example of what the contents of this binary file are encoded as? Assuming the data is valid JSON, they must be valid UTF-8, and can therefore be read as a regular StringArray, i.e. DataType::Utf8.

Now the question is what does the downstream expect, e.g. if the data in the JSON file is actually base64 encoded, do they expect a BinaryArray with this data decoded? Or is the fact the data is labelled binary just a schema error and the data is actually UTF-8 but the charset has just been lost somewhere down the line? This will determine what post-processing on the output is necessary, if any.

mightyshazam · 2023-10-20T22:20:12Z

The example is a log file for a delta table. The log entries are in json and they contain a stats field that lists the min/max for each column in the schema. delta-rs writes the contents of binary columns as a serialized string using the below method

let escaped_bytes = v
                    .into_iter()
                    .flat_map(std::ascii::escape_default)
                    .collect::<Vec<u8>>();
let escaped_string = String::from_utf8(escaped_bytes).unwrap();

When creating a checkpoint, it takes that accumulated json from the various log files that looks something like this {"add":{"path":"part-00001-8a7f1b6b-5869-4c37-8760-462ee9c97d49-c000.parquet","size":1522,"partitionValues":{},"modificationTime":1697658217285,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"tenant_model_id\":\"00000004\",\"content\":\"00000004\",\"job_id\":\"00000004\",\"platform_model_id\":\"00000004\"},\"maxValues\":{\"tenant_model_id\":\"00000004\",\"content\":\"00000004\",\"job_id\":\"00000004\",\"platform_model_id\":\"00000004\"},\"nullCount\":{\"job_id\":0,\"content\":0,\"tenant_model_id\":0,\"platform_model_id\":0}}"}} where the content field is the binary field in this case.
The checkpoint code takes a series of json entries like the one above and uses the ReaderBuilder's serialize method to turn that into an RecordBatch since the checkpoints are in parquet format.
The issue isn't that the schema is in correct, but the usage may be. It might make sense to convert the delta schema to an arrow schema with utf-8 columns because that is what is ultimately written. However, the code as is just converts directly from a delta schema to an arrow schema and builds the decoder. Due to the lack of binary column support, the method returns an error instead of a decoder.

scovich · 2025-02-13T18:48:09Z

IMO that older PR does NOT solve the problem mentioned here. That PR gives a way to cast from binary to string (that happens to contain base64-encded binary data). But it's still casting, which means somebody has to manipulate schemas on both read and write path. Arrow JSON still doesn't have any actual support for encoding and decoding the arrow binary type.

That said, there's definitely an argument to be made that round trip encoding needs domain knowledge (ie does the sender and/or receiver expect hex? base64? ascii85? which variant? etc. So maybe manual casts are the best arrow itself can expose, to avoid getting sucked into that rabbit hole?

Add optional handling for binary arrays in json

5352b9b

github-actions bot added the arrow Changes to the arrow crate label Oct 20, 2023

mightyshazam and others added 2 commits October 20, 2023 14:05

added parameter names to BinaryDecoderProvider

20485ad

Merge branch 'master' into arrow-json-binary-serialization

f7b91a3

tustvold mentioned this pull request Oct 23, 2023

Add arrow_cast::base64 and document usage in arrow_json #4975

Merged

tustvold closed this in #4975 Oct 23, 2023

tustvold mentioned this pull request Nov 2, 2023

arrow_json: support binary deserialization #4945

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add optional handling for binary arrays in json #4967

Add optional handling for binary arrays in json #4967

Uh oh!

mightyshazam commented Oct 20, 2023

Uh oh!

tustvold commented Oct 20, 2023

Uh oh!

mightyshazam commented Oct 20, 2023

Uh oh!

tustvold commented Oct 20, 2023 •

edited

Loading

Uh oh!

mightyshazam commented Oct 20, 2023

Uh oh!

scovich commented Feb 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add optional handling for binary arrays in json #4967

Add optional handling for binary arrays in json #4967

Uh oh!

Conversation

mightyshazam commented Oct 20, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

tustvold commented Oct 20, 2023

Uh oh!

mightyshazam commented Oct 20, 2023

Uh oh!

tustvold commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mightyshazam commented Oct 20, 2023

Uh oh!

scovich commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tustvold commented Oct 20, 2023 •

edited

Loading

scovich commented Feb 13, 2025 •

edited

Loading