Skip to content

Investigate Pub/Sub Encoding #453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
blowmage opened this issue Dec 4, 2015 · 14 comments
Closed

Investigate Pub/Sub Encoding #453

blowmage opened this issue Dec 4, 2015 · 14 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. type: question Request for information or clarification. Not an issue.

Comments

@blowmage
Copy link
Contributor

blowmage commented Dec 4, 2015

Does gcloud-ruby need an encoding option for encoding to/from base64? Or is the string's built-in encoding sufficient?

See googleapis/google-cloud-node#988

@blowmage blowmage added type: question Request for information or clarification. Not an issue. api: pubsub Issues related to the Pub/Sub API. labels Dec 4, 2015
@blowmage blowmage self-assigned this Dec 4, 2015
@blowmage
Copy link
Contributor Author

blowmage commented Dec 4, 2015

I've run several experiments, and I can isolate the issue down to how base64 works on strings and data. This means I can illustrate the problem without using gcloud-ruby in the following example code. When a string or binary data is encoded as base64, all string encoding is lost. This means that a UTF-8 string will be converted to an ASCII 8 bit (binary) string.

require "base64"

s1 = '¯\_(ツ)_/¯'
s1 #=> "¯\\_(ツ)_/¯"
s1.encoding #=> #<Encoding:UTF-8>

s2 = Base64.decode64 Base64.encode64 s1
s2 #=> "\xC2\xAF\\_(\xE3\x83\x84)_/\xC2\xAF"
s2.encoding #=> #<Encoding:ASCII-8BIT>

The good news here is that data read from a binary file is properly encoded and decoded from base64. The bad news is that a string that is encoded as UTF-8 will have that encoding lost, as shown above. So any characters outside of ASCII 7 will be improperly encoded. The solution in ruby-land it so enforce the encoding like so:

s2 #=> "\xC2\xAF\\_(\xE3\x83\x84)_/\xC2\xAF"
s2.encoding #=> #<Encoding:ASCII-8BIT>

# Force the encoding of the returned data
s2.force_encoding "UTF-8"
s2 #=> "¯\\_(ツ)_/¯"
s2.encoding #=> #<Encoding:UTF-8>

gcloud-node had a problem because it always forced the data to a UTF-8 string and lost data because of it. gcloud-ruby has a different problem where it loses the encoding of the string, requiring the users to force the encoding after the data is returned.

I think we have the following options:

  1. We can leave the behavior as-is. Data is not lost, only the encoding is lost. The users should know whether a string or binary data is on the message and can respond appropriately.
  2. We can automagically force the encoding of the returned message data if a specific named attribute is present. This is what the gcloud-node project has added with the "encoding" attribute. If the message does not have an "encoding" attribute then gcloud-ruby should behave as it does now and not force an encoding.
  3. We can match gcloud-node and always convert message data to/from a UTF-string, unless the users opt-out. (I like this option the least.)

@jgeewax @quartzmo Thoughts?

@quartzmo
Copy link
Member

quartzmo commented Dec 4, 2015

Is this issue about binary support (like gcloud-node #987, "Binary messages are unsupported by the js pubsub API"), or is it about preserving string encoding? Maybe we need two issues? Just kind of hard for me to follow.

@blowmage
Copy link
Contributor Author

blowmage commented Dec 4, 2015

The gcloud-node issue is saying that messages with binary data pulled by gcloud-node will corrupt or lose parts of the data, because of how gcloud-node was always forcing the UTF-8 encoding. gcloud-ruby does not have that problem.

@quartzmo
Copy link
Member

quartzmo commented Dec 4, 2015

Great! Now that I understand that there is no problem in gcloud-ruby with binary values, my preference for the string encoding issue is for either 1) with doc example showing the call to force_encoding "UTF-8", or 2) adding an option to do this force encoding in the library. I would only consider 3) if we receive external requests for it.

@blowmage
Copy link
Contributor Author

blowmage commented Dec 4, 2015

The issue as I see it is that we may be violating some users' expectations by not automatically converting message data to the proper encoding. The problem here is that there is no mechanism in Pub/Sub to specify how the message data is expected to be used. It could be a binary stream of bytes, or it could be a UTF-8 encoded string, or it could be a ISO-2022-JP encoded string. But since we know the encoding in ruby when publishing the message data, we do have the opportunity to automatically add that attribute to the message.

The counter-argument to that is all about being cautious about making too many assumptions. What if someone publishes a message with binary data and wants to use the "encoding" attribute for their own processing? Or the worker pulling the messages is not written in ruby, but is using another library with a different set of assumptions. One would think that those pulling the messages would know what type of data was published and how to deal with it. If we do 2 or 3 then we can cause problems in these scenarios. Something to keep in mind.

@quartzmo
Copy link
Member

quartzmo commented Dec 4, 2015

Very helpful analysis, thanks!

The problem here is that there is no mechanism in Pub/Sub to specify how the message data is expected to be used.

Since we're not currently doing anything harmful, and this may someday be addressed by Pub/Sub itself, maybe leave our behavior as is, and just add some doc discussion for now?

@blowmage
Copy link
Contributor Author

blowmage commented Dec 4, 2015

I have a very slight preference for 1). If we stay with the existing behavior then I agree we should address how to fix the encoding of a message's data in the documentation.

What I like about 2) is that we could delight users by removing all the encoding friction for them. I'm hoping someone pushes me over to this option...

@quartzmo
Copy link
Member

quartzmo commented Dec 4, 2015

I also vote for 1) with adding discussion in the docs.

@blowmage
Copy link
Contributor Author

blowmage commented Dec 4, 2015

But option 2) is so shiny and nice...

@quartzmo
Copy link
Member

quartzmo commented Dec 4, 2015

Not ruling it out! If anyone reading this in the future wishes we had implemented 2), please let us know.

@blowmage
Copy link
Contributor Author

blowmage commented May 5, 2016

Having now dealt with encoding issues for supporting Datastore's blob values, I am squarely in option 1's camp. Also, #605 pretty much cements this option.

Also also, I also think we could extend our support IO-like objects to Pub/Sub so that users could set a message's data to be a File. This would be handy for stuff like passing images in to get resized.

@mia-0032
Copy link

mia-0032 commented May 27, 2016

Option 1) or 2) seem good to me.

Our system has two encodings UTF-8 and Shift-JIS. Publishing Shift-JIS charactor I would like to pull Shift-JIS charactor. So option 3) is not good for me.

I use gcloud-ruby for publishing messages, but pulling messages I use another packages. If option 2) is common to Pubsub clients, I think option 2) is good.
In another aspect, if we change encodings of messages from the middle, option 2) seem easy to handle encoding in subscription applications.
Because we can add the same attribute from external, I think option 1) have no problem.

@blowmage
Copy link
Contributor Author

Option 2) would be specific to gcloud-ruby. AFAIK there is no standard among the various Pub/Sub clients to specify or atomically format the message data bytes.

@blowmage
Copy link
Contributor Author

This is not going to happen. Multibyte strings will be passed as their bytes, and returned as a ASCII-8BIT string. This is consistent with how Google Cloud APIs deal with binary data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

3 participants