-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should decode not encode UTF-8 messages? #277
Comments
we run value = value.encode('utf-8') merely to ensure that the data is valid unicode. the motivation behind this is that in the past we have received corrupted data from Message Hub and that message is passed as part of the payload to the request which itself attempts to encode the incoming data as part of the
As you can see the call to But the ultimate point of running The data therefore is arriving corrupt and is not being corrupted by the use of this function call |
Geez, that is super confusing ( ;-) ). All I could tell from the docs (and my understanding of Kafka) is that the messages are always bytes which need to be properly converted to text (assuming it is text of course). I think that looks like a destructive test though since it assigns There is some discussion in the #whisk-users channel about some data corruption. It was confirmed that the data is fine through Kafka (consumer shows right data) and obviously the action can be passed proper data and it looks fine. So this looked suspicious since it is the "middleman" here. The discussion is here: https://ibm-ics.slack.com/archives/C0BUS3JE8/p1534022249000007 |
I agree with Scott that there is no issue with the IBM Message Hub instance (Kafka) WRT corruption of messages. To confirm that was indeed the case, I created:
The producer posts a message with some English and some non-English characters to a topic called Greeter. The consumer polls the same topic and prints out the message. Here is the output I am receiving from each: Producer: Consumer: As you can see, the message is being received with the correct encoding from the IBM Message Hub topic. However, if I post the same message using the producer class and consume it via the IBM Cloud Function, the message is received as corrupted as described in the Slack conversation that Scott pointed to in his earlier comment. |
Also note that in none of my implementations, I am specifying any encoding. I am relying on the default of UTF-8 as the encoding on both the producer as well as the consumer side and assuming that the underlying transport is bytes. |
And we know OpenWhisk Actions can receive parameters don't corrupt the text. Should be pretty simple to validate. |
Yes, I have already verified that and eliminated that when the OpenWhisk function/trigger is called directly (using curl) with the same payload, there is no corruption of data. Its only when the data is posted to the message hub topic and received by the whisk function (via this handler), we see corruption of data. |
Confirming that there is no issue with neither the function nor the user provided IBM Message Hub trigger (calling the function) when invoked directly. Here is the payload being sent each time: diamond:kafka mmehra$ cat data.json
diamond:kafka mmehra$ curl -u [XXXX:YYYYY] -d @data.json --header "Content-Type: application/json" -X POST https://openwhisk.ng.bluemix.net/api/v1/namespaces/XXXXX_dev/actions/greeting_package/say_hello?blocking=true
Attached below is a screenshot of the curl call made to the user-provided trigger: Attached below is a screenshot of the console log generated by the IBM Cloud function, showing what parameters it received in both cases: |
@ScottChapman yes i agree. it is very confusing. especially the way python 2.7 handles strings and bytes. @maneeshmehra thanks for providing that info. i'll do some testing on the provider end with your input using encode vs decode. my understanding of these builtin python functions is quite limited. i was going off of my understanding of the behavior, but as i said my understanding is quite limited and the way python handles things is rather confusing. a little testing should be able to clear things up though thank you both for digging into this and providing feedback |
You are most welcome. Please keep us posted via this issue on your findings. |
Any reason not to run as python 3? Could normalize utf8 string handling. |
The fix for this encoding issue has now been deployed into our production environment on Sep 4th through a support ticket. |
Discovered that messages containing unicode text appear to be getting corrupted.
I suspect it is this:
https://github.com/apache/incubator-openwhisk-package-kafka/blob/449bbae13e813ba4dcd11dc33f47ab29d5e3541a/provider/consumer.py#L455
From the kafka-python docs
The text was updated successfully, but these errors were encountered: