-
Notifications
You must be signed in to change notification settings - Fork 557
Investigate Pub/Sub Encoding #453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've run several experiments, and I can isolate the issue down to how base64 works on strings and data. This means I can illustrate the problem without using gcloud-ruby in the following example code. When a string or binary data is encoded as base64, all string encoding is lost. This means that a UTF-8 string will be converted to an ASCII 8 bit (binary) string. require "base64"
s1 = '¯\_(ツ)_/¯'
s1 #=> "¯\\_(ツ)_/¯"
s1.encoding #=> #<Encoding:UTF-8>
s2 = Base64.decode64 Base64.encode64 s1
s2 #=> "\xC2\xAF\\_(\xE3\x83\x84)_/\xC2\xAF"
s2.encoding #=> #<Encoding:ASCII-8BIT> The good news here is that data read from a binary file is properly encoded and decoded from base64. The bad news is that a string that is encoded as UTF-8 will have that encoding lost, as shown above. So any characters outside of ASCII 7 will be improperly encoded. The solution in ruby-land it so enforce the encoding like so: s2 #=> "\xC2\xAF\\_(\xE3\x83\x84)_/\xC2\xAF"
s2.encoding #=> #<Encoding:ASCII-8BIT>
# Force the encoding of the returned data
s2.force_encoding "UTF-8"
s2 #=> "¯\\_(ツ)_/¯"
s2.encoding #=> #<Encoding:UTF-8> gcloud-node had a problem because it always forced the data to a UTF-8 string and lost data because of it. gcloud-ruby has a different problem where it loses the encoding of the string, requiring the users to force the encoding after the data is returned. I think we have the following options:
|
Is this issue about binary support (like gcloud-node #987, "Binary messages are unsupported by the js pubsub API"), or is it about preserving string encoding? Maybe we need two issues? Just kind of hard for me to follow. |
The gcloud-node issue is saying that messages with binary data pulled by gcloud-node will corrupt or lose parts of the data, because of how gcloud-node was always forcing the UTF-8 encoding. gcloud-ruby does not have that problem. |
Great! Now that I understand that there is no problem in gcloud-ruby with binary values, my preference for the string encoding issue is for either 1) with doc example showing the call to |
The issue as I see it is that we may be violating some users' expectations by not automatically converting message data to the proper encoding. The problem here is that there is no mechanism in Pub/Sub to specify how the message data is expected to be used. It could be a binary stream of bytes, or it could be a UTF-8 encoded string, or it could be a ISO-2022-JP encoded string. But since we know the encoding in ruby when publishing the message data, we do have the opportunity to automatically add that attribute to the message. The counter-argument to that is all about being cautious about making too many assumptions. What if someone publishes a message with binary data and wants to use the "encoding" attribute for their own processing? Or the worker pulling the messages is not written in ruby, but is using another library with a different set of assumptions. One would think that those pulling the messages would know what type of data was published and how to deal with it. If we do 2 or 3 then we can cause problems in these scenarios. Something to keep in mind. |
Very helpful analysis, thanks!
Since we're not currently doing anything harmful, and this may someday be addressed by Pub/Sub itself, maybe leave our behavior as is, and just add some doc discussion for now? |
I have a very slight preference for 1). If we stay with the existing behavior then I agree we should address how to fix the encoding of a message's data in the documentation. What I like about 2) is that we could delight users by removing all the encoding friction for them. I'm hoping someone pushes me over to this option... |
I also vote for 1) with adding discussion in the docs. |
But option 2) is so shiny and nice... |
Not ruling it out! If anyone reading this in the future wishes we had implemented 2), please let us know. |
Having now dealt with encoding issues for supporting Datastore's blob values, I am squarely in option 1's camp. Also, #605 pretty much cements this option. Also also, I also think we could extend our support IO-like objects to Pub/Sub so that users could set a message's data to be a File. This would be handy for stuff like passing images in to get resized. |
Option 1) or 2) seem good to me. Our system has two encodings I use gcloud-ruby for publishing messages, but pulling messages I use another packages. If option 2) is common to Pubsub clients, I think option 2) is good. |
Option 2) would be specific to gcloud-ruby. AFAIK there is no standard among the various Pub/Sub clients to specify or atomically format the message data bytes. |
This is not going to happen. Multibyte strings will be passed as their bytes, and returned as a ASCII-8BIT string. This is consistent with how Google Cloud APIs deal with binary data. |
Does gcloud-ruby need an encoding option for encoding to/from base64? Or is the string's built-in encoding sufficient?
See googleapis/google-cloud-node#988
The text was updated successfully, but these errors were encountered: