Investigate Pub/Sub Encoding #453

blowmage · 2015-12-04T14:27:24Z

Does gcloud-ruby need an encoding option for encoding to/from base64? Or is the string's built-in encoding sufficient?

blowmage · 2015-12-04T18:18:16Z

I've run several experiments, and I can isolate the issue down to how base64 works on strings and data. This means I can illustrate the problem without using gcloud-ruby in the following example code. When a string or binary data is encoded as base64, all string encoding is lost. This means that a UTF-8 string will be converted to an ASCII 8 bit (binary) string.

require "base64"

s1 = '¯\_(ツ)_/¯'
s1 #=> "¯\\_(ツ)_/¯"
s1.encoding #=> #<Encoding:UTF-8>

s2 = Base64.decode64 Base64.encode64 s1
s2 #=> "\xC2\xAF\\_(\xE3\x83\x84)_/\xC2\xAF"
s2.encoding #=> #<Encoding:ASCII-8BIT>

The good news here is that data read from a binary file is properly encoded and decoded from base64. The bad news is that a string that is encoded as UTF-8 will have that encoding lost, as shown above. So any characters outside of ASCII 7 will be improperly encoded. The solution in ruby-land it so enforce the encoding like so:

s2 #=> "\xC2\xAF\\_(\xE3\x83\x84)_/\xC2\xAF"
s2.encoding #=> #<Encoding:ASCII-8BIT>

# Force the encoding of the returned data
s2.force_encoding "UTF-8"
s2 #=> "¯\\_(ツ)_/¯"
s2.encoding #=> #<Encoding:UTF-8>

gcloud-node had a problem because it always forced the data to a UTF-8 string and lost data because of it. gcloud-ruby has a different problem where it loses the encoding of the string, requiring the users to force the encoding after the data is returned.

I think we have the following options:

We can leave the behavior as-is. Data is not lost, only the encoding is lost. The users should know whether a string or binary data is on the message and can respond appropriately.
We can automagically force the encoding of the returned message data if a specific named attribute is present. This is what the gcloud-node project has added with the "encoding" attribute. If the message does not have an "encoding" attribute then gcloud-ruby should behave as it does now and not force an encoding.
We can match gcloud-node and always convert message data to/from a UTF-string, unless the users opt-out. (I like this option the least.)

@jgeewax @quartzmo Thoughts?

quartzmo · 2015-12-04T18:47:39Z

Is this issue about binary support (like gcloud-node #987, "Binary messages are unsupported by the js pubsub API"), or is it about preserving string encoding? Maybe we need two issues? Just kind of hard for me to follow.

blowmage · 2015-12-04T18:53:13Z

The gcloud-node issue is saying that messages with binary data pulled by gcloud-node will corrupt or lose parts of the data, because of how gcloud-node was always forcing the UTF-8 encoding. gcloud-ruby does not have that problem.

quartzmo · 2015-12-04T18:54:26Z

Great! Now that I understand that there is no problem in gcloud-ruby with binary values, my preference for the string encoding issue is for either 1) with doc example showing the call to force_encoding "UTF-8", or 2) adding an option to do this force encoding in the library. I would only consider 3) if we receive external requests for it.

blowmage · 2015-12-04T19:08:02Z

The issue as I see it is that we may be violating some users' expectations by not automatically converting message data to the proper encoding. The problem here is that there is no mechanism in Pub/Sub to specify how the message data is expected to be used. It could be a binary stream of bytes, or it could be a UTF-8 encoded string, or it could be a ISO-2022-JP encoded string. But since we know the encoding in ruby when publishing the message data, we do have the opportunity to automatically add that attribute to the message.

The counter-argument to that is all about being cautious about making too many assumptions. What if someone publishes a message with binary data and wants to use the "encoding" attribute for their own processing? Or the worker pulling the messages is not written in ruby, but is using another library with a different set of assumptions. One would think that those pulling the messages would know what type of data was published and how to deal with it. If we do 2 or 3 then we can cause problems in these scenarios. Something to keep in mind.

quartzmo · 2015-12-04T19:13:01Z

Very helpful analysis, thanks!

The problem here is that there is no mechanism in Pub/Sub to specify how the message data is expected to be used.

Since we're not currently doing anything harmful, and this may someday be addressed by Pub/Sub itself, maybe leave our behavior as is, and just add some doc discussion for now?

blowmage · 2015-12-04T19:15:16Z

I have a very slight preference for 1). If we stay with the existing behavior then I agree we should address how to fix the encoding of a message's data in the documentation.

What I like about 2) is that we could delight users by removing all the encoding friction for them. I'm hoping someone pushes me over to this option...

quartzmo · 2015-12-04T19:16:49Z

I also vote for 1) with adding discussion in the docs.

blowmage · 2015-12-04T19:18:30Z

But option 2) is so shiny and nice...

quartzmo · 2015-12-04T19:21:15Z

Not ruling it out! If anyone reading this in the future wishes we had implemented 2), please let us know.

blowmage · 2016-05-05T20:30:24Z

Having now dealt with encoding issues for supporting Datastore's blob values, I am squarely in option 1's camp. Also, #605 pretty much cements this option.

Also also, I also think we could extend our support IO-like objects to Pub/Sub so that users could set a message's data to be a File. This would be handy for stuff like passing images in to get resized.

mia-0032 · 2016-05-27T10:14:55Z

Option 1) or 2) seem good to me.

Our system has two encodings UTF-8 and Shift-JIS. Publishing Shift-JIS charactor I would like to pull Shift-JIS charactor. So option 3) is not good for me.

I use gcloud-ruby for publishing messages, but pulling messages I use another packages. If option 2) is common to Pubsub clients, I think option 2) is good.
In another aspect, if we change encodings of messages from the middle, option 2) seem easy to handle encoding in subscription applications.
Because we can add the same attribute from external, I think option 1) have no problem.

blowmage · 2016-05-27T17:48:53Z

Option 2) would be specific to gcloud-ruby. AFAIK there is no standard among the various Pub/Sub clients to specify or atomically format the message data bytes.

blowmage · 2016-06-17T16:39:13Z

This is not going to happen. Multibyte strings will be passed as their bytes, and returned as a ASCII-8BIT string. This is consistent with how Google Cloud APIs deal with binary data.

blowmage added type: question Request for information or clarification. Not an issue. api: pubsub Issues related to the Pub/Sub API. labels Dec 4, 2015

blowmage self-assigned this Dec 4, 2015

blowmage mentioned this issue May 23, 2016

Does any processing need to publish multibyte charactor to Pubsub? #709

Closed

blowmage closed this as completed Jun 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Pub/Sub Encoding #453

Investigate Pub/Sub Encoding #453

blowmage commented Dec 4, 2015

blowmage commented Dec 4, 2015

quartzmo commented Dec 4, 2015

blowmage commented Dec 4, 2015

quartzmo commented Dec 4, 2015

blowmage commented Dec 4, 2015

quartzmo commented Dec 4, 2015

blowmage commented Dec 4, 2015

quartzmo commented Dec 4, 2015

blowmage commented Dec 4, 2015

quartzmo commented Dec 4, 2015

blowmage commented May 5, 2016

mia-0032 commented May 27, 2016 •

edited

Loading

blowmage commented May 27, 2016

blowmage commented Jun 17, 2016

Investigate Pub/Sub Encoding #453

Investigate Pub/Sub Encoding #453

Comments

blowmage commented Dec 4, 2015

blowmage commented Dec 4, 2015

quartzmo commented Dec 4, 2015

blowmage commented Dec 4, 2015

quartzmo commented Dec 4, 2015

blowmage commented Dec 4, 2015

quartzmo commented Dec 4, 2015

blowmage commented Dec 4, 2015

quartzmo commented Dec 4, 2015

blowmage commented Dec 4, 2015

quartzmo commented Dec 4, 2015

blowmage commented May 5, 2016

mia-0032 commented May 27, 2016 • edited Loading

blowmage commented May 27, 2016

blowmage commented Jun 17, 2016

mia-0032 commented May 27, 2016 •

edited

Loading