-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str vs bytes decisions #12
Comments
After several months to reflect, I think header, addheader should be str to be consistent with chgheader. There may be useful things to do with encoding in the future. (internationalizing, for instance) |
i just happened to received some emails with non-ascii chars in the subject line without properly encoding, pymilter threw "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 8: invalid start byte".. I don't know where to look at because there was no traceback..:-( |
Is this with python3? |
yea..using Python 3.5.1 |
sdgathman#12 says "Obviously, body and replacebody are bytes" and milter_wrap_body in miltermodule.c says: arglist = Py_BuildValue("(Oy#)", c, bodyp, bodylen); … So pymilter should sport the correct documentation.
#12 says "Obviously, body and replacebody are bytes" and milter_wrap_body in miltermodule.c says: arglist = Py_BuildValue("(Oy#)", c, bodyp, bodylen); … So pymilter should sport the correct documentation.
#12 says "Obviously, body and replacebody are bytes" and milter_wrap_body in miltermodule.c says: arglist = Py_BuildValue("(Oy#)", c, bodyp, bodylen); … So pymilter should sport the correct documentation.
Same here. How do I fix this? I'm using Python 3.6.8 on CentOS 7 and I get a similar error. Tested on both python36-pymilter-1.0.3-2.el7 rpm and on 1.0.4. Same code works with Python 2.7.5. $ python $ python3 |
i fixed it in my pymilter fork and tested on a production server for a long time, so far so good. |
Great, now testing, thanks. |
More important than testing for a long time is a test case to add to unit testing. Did you save the email that made it fail? Maybe just the Subject header field to add to a test email.. |
The testing mail is pretty simple, just change an existing mail subject line to contain some utf-8 characters, then it will crash the milter. I didn't create a pull request as my patch require change in the milter to work with new header return (bytes) which will break a lot of existing milter. I'm using my patched version on a logistic company email server which mails tend to be from different countries/mail clients and language encoding, so far it works pretty well. |
Perhaps a unit test against an email that has this string as Subject value? |
I'm having a slightly different experience with the UTF-8 characters in Subject header. In the case of UTF-8 bytes being present in the header, the milter does not crash. However when non-UTF8 (and by extension non-ascii) bytes are present it does crash. So for example the byte 163 "£" symbol (8859-1) in a subject line will cause the milter to crash. However the equivalent in UTF-8 (0xC2 0xA3) is causing no problem. If I use @william6502 's fork and return bytes then the milter does not crash. As it happens because Postfix tolerates these invalid bytes in Headers, we have to be able to ignore/handle them. At present our Milter cannot deal with these emails. We are running Python 3.5.3 |
I think what is needed is for the low level header callback to pass bytes. Then have another layer that can be hooked. I.e. the non-OO python callback is in Milter.init.py and should take bytes, converting via a supplied encoding before calling the OO header method. The encoding can be set to None for the OO header method to get bytes, and I can introduce an encoding exception method as well. What do you think? |
Encoding exceptions in a callback are somewhat awkward. Hey, I think the encoding should be supplied via a decorator, since that works well for a lot of the other callback options! |
What about the other way? To make the low level addheader and chgheader take bytes, we need to null terminate the bytes in miltermodule, since the Python C API does not provide that function in python3. Or did I miss it? Wait, isn't there an encoding that is essentially "hi bytes zero"? How do we make the C API use that? *** Goes to search docs... |
Well I would very much defer to your judgement. I don't know half as much as I should about how libmilter or pymilter work - except that they're a huge boon to me! Your first suggestion sounds good if I understood it. If I understand the situation correctly there's probably a lot of code out there with a header callback that expects a string (including probably in my infra). If I could add a decorator that caused it to receive bytes instead that would be fantastic from a useability / compatibility point of view. |
In reviewing this long standing issue, I realized that chgheader is the only API making bytes for everything at the low level inconsistent. True, py3 offers no 'y' equivalent of 'z'. BUT, I can simply test for None, and alter the chgheader call accordingly. So I can make everything bytes. Then, we can patch up compatibility stuff in python. |
Bytes everywhere would be great. For what I do, and I suspect most common usage, I just want to get the whole email as bytes, and then use email.message_from_binary_file() in the eom() callback where everything interesting happens. |
@scandox Added test case that gets this traceback in python3: a01f598
Does that cover the problems in production? Do I need another test case? |
Nah, test case was not reading file as binary. Now it uncovers a bug in testctx. |
Once again, the built-in email package utterly fails me. >:-( While it seems to be able to parse the malformed test file with email.message_from_binary_file(fp), and the invalid header results in a header object, which (reasonably) does not convert to a str, there doesn't seem to be any way to get the actual bytes of the header. I need that to be able to pass bytes to callbacks in the test frameworks - if I am going to change the low level API. |
I may have to port mimetools from python2.7. :-( |
The screwup happens early in email.parser.BytesParser.parse:
|
surrogateescape saves the original bytes in a way that encode can recover. So, 4749f0f now produces this on the invalid header bytes test case:
At this point, an application could recover the original bytes like I did in testctx.py:
But, with this test case added, I'll work on adding a decorator for header callback. |
Instead of a decorator, I think it is cleaner to add a header_bytes() method to Milter.Base, which is invoked on callback:
Applications could then override header_bytes() to deal directly with bytes, or header() to get a surrogate escaped unicode str. Any objections? Suggestions for better name? |
Ok, that solution makes protocol_mask() too complicated, as it then has to deal with an alias for the header call back. How about we just pass the original bytes as a keyword arg which the app can ignore? No, that requires the encode even if not used. How about if apps just do:
Is that clean enough to skip the surrogate encoding? Maybe a class decorator could do that? |
Next idea: have default header_bytes pass str decoded as 'ascii'. If that fails, pass the bytes instead. That counts as an api change - is it too late to make that change? |
Just a quick question since i stumbled upon this issue, too: Is the fix ready to be pushed to pypi.org as a new version? |
@JPT580 I think @sdgathman is still cogitating on the best way to fix it. I think there's a lot of different considerations involved reviewing this thread. So not an easy decision. We are using @william6502 's fork which works for our purposes but which might not suit all cases. |
@scandox Thanks for your reply! |
The current version on github has bytes available via surrogate encoding. It looks like pypi.org has 1.0.4. Let me check if that includes surrogate encoding. Going forward, I want to have both header_bytes and header callbacks. The idea is for milter module to invoke header_bytes callback, which in turn invokes header callback with bytes decoded from ascii to str with surrogates. The tricky part is making sure the magic callback decorators still work. You have always been able to hook the low level module callback to get bytes. |
Here is the proposed API for header callback (similar for envfrom and envrcpt): Please comment. Still to be considered: header field names MUST be 7-bit ascii - but there will no doubt be emails with illegal bytes in the field name. Current behaviour is miltermodule.c gets an internal decoding exception, and returns 4xx error. Not ideal, since it looks like a bug in the milter. I suppose I have to add the same mechanics for field name as field value. Second issue: helo MUST be a FQDN. But half the time it isn't, and I wouldn't be surprised if it often has illegal bytes for utf-8. So helo probably needs the same treatment. Currently, illegal utf-8 bytes in helo trigger a 4xx response. Third issue: going through header_bytes on the way to header callback could be a performance issue. You can get the same behaviour as @decode('bytes') by assigning:
I don't have good benchmarks to see how important this is. Same for envrcpt. The header_bytes method currently replaces itself with header the first time it is called when header has the |
Hi @sdgathman, we are facing the same issues and currently we patch pymilter as suggested by @william6502 But that way we not only have to adapt our own custom milters, but also have to patch other python milters we use. Is there any code already we might give a try? Or would you be willing to accept patches / PRs that follow your approach in the proposed API? |
The code has been posted since June. Would if help it I make a "pre" release? |
Did some debugging with new framework: e7592c6 |
On Tue, 16 Mar 2021, Attila Nagy wrote:
Any news on this? With actual master, I get this for Subject: árvíztűrő
tükörfúrógép in header(self, name, value):
>>> print(repr(name), repr(value))
'Subject' '\udce1rv\udcedzt\udcfbr\udcf5 t\udcfck\udcf6rf\udcfar\udcf3g\udce
9p'
As a network server, I expect these to be bytes, because the use can specify
anything (like here, latin2 characters).
With current API, you can either override header_bytes(), or use
the @decode('bytes') decorator on header().
|
Yeah, sorry, I've read the discussion and deleted the comment . :) |
On Tue, 16 Mar 2021, Attila Nagy wrote:
With current API, you can either override header_bytes(), or use
the @decode('bytes') decorator on header().
Yeah, sorry, I've read the discussion and deleted the comment . :)
No worries. The new API seems overly complex, which is why I'm still
not happy with it. I welcome bug reports and criticism.
|
I've recently gotten feedback from a German dkimpy-milter user that the current released API is causing problems due to, for example, ISO-85591 "Umlauts". I understand this isn't easy, but it would be really nice to see a release I can use to update dkipmy-milter to properly process non-UTF-8. |
Is that just a matter of using header_bytes (or decorator)? |
I agree. Time for a release. The new API seems workable, and if there are any performance problem, just override header_bytes. Although there may be some interaction with the negotiate API. |
It is fairly easy to chose between str and bytes in the C API, it is typically choosing s or y in the format strings for PyArg_Parse and Py_BuildValue. One exception is that there is no equivalent of z (allow None and convert to NULL) for y. This prevents simply making everything bytes, even though the C side does everything in 8-bits.
Obviously, body and replacebody are bytes. Pathname (from connect callback for unix socket) is recommended by C-API docs to be str, converted by a standard pathname converter (which handles unix vs windows, etc).
The header and addheader I made bytes, but then chgheader has to deal with passing None, so I can't simply use the y format, and making chgheader str while the other two are bytes would be inconsistent.
The text was updated successfully, but these errors were encountered: