str vs bytes decisions #12

sdgathman · 2016-11-01T20:19:53Z

It is fairly easy to chose between str and bytes in the C API, it is typically choosing s or y in the format strings for PyArg_Parse and Py_BuildValue. One exception is that there is no equivalent of z (allow None and convert to NULL) for y. This prevents simply making everything bytes, even though the C side does everything in 8-bits.

Obviously, body and replacebody are bytes. Pathname (from connect callback for unix socket) is recommended by C-API docs to be str, converted by a standard pathname converter (which handles unix vs windows, etc).

The header and addheader I made bytes, but then chgheader has to deal with passing None, so I can't simply use the y format, and making chgheader str while the other two are bytes would be inconsistent.

sdgathman · 2017-03-20T04:00:51Z

After several months to reflect, I think header, addheader should be str to be consistent with chgheader. There may be useful things to do with encoding in the future. (internationalizing, for instance)

william6502 · 2018-08-07T12:52:27Z

i just happened to received some emails with non-ascii chars in the subject line without properly encoding, pymilter threw "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 8: invalid start byte".. I don't know where to look at because there was no traceback..:-(

sdgathman · 2018-08-07T14:20:35Z

Is this with python3?

william6502 · 2018-08-07T17:20:25Z

yea..using Python 3.5.1

sdgathman#12 says "Obviously, body and replacebody are bytes" and milter_wrap_body in miltermodule.c says: arglist = Py_BuildValue("(Oy#)", c, bodyp, bodylen); … So pymilter should sport the correct documentation.

#12 says "Obviously, body and replacebody are bytes" and milter_wrap_body in miltermodule.c says: arglist = Py_BuildValue("(Oy#)", c, bodyp, bodylen); … So pymilter should sport the correct documentation.

apircalabu · 2019-06-27T06:40:33Z

i just happened to received some emails with non-ascii chars in the subject line without properly encoding, pymilter threw "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 8: invalid start byte".. I don't know where to look at because there was no traceback..:-(

Same here. How do I fix this? I'm using Python 3.6.8 on CentOS 7 and I get a similar error. Tested on both python36-pymilter-1.0.3-2.el7 rpm and on 1.0.4. Same code works with Python 2.7.5.

$ python
Python 2.7.5 (default, Jun 20 2019, 20:27:34)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> val=b'\xc0\xb4\xd7\xd4'
>>> val.decode()
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 0: ordinal not in range(128)
>>> val.decode('latin-1')
u'\xc0\xb4\xd7\xd4'
>>> chardet.detect(val)
{'confidence': 0.7679697235616183, 'language': 'Russian', 'encoding': 'KOI8-R'}
>>> val.decode('KOI8-R')
u'\u044e\u2562\u0432\u0442'

$ python3
Python 3.6.8 (default, Apr 25 2019, 21:02:35)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> val=b'\xc0\xb4\xd7\xd4'
>>> val.decode()
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte
>>> val.decode('latin-1')
'À´×Ô'
>>> chardet.detect(val)
{'encoding': 'KOI8-R', 'confidence': 0.7679697235616183}
>>> val.decode('KOI8-R')
'ю╢вт'

william6502 · 2019-06-27T06:54:35Z

i fixed it in my pymilter fork and tested on a production server for a long time, so far so good.

apircalabu · 2019-06-27T07:08:44Z

Great, now testing, thanks.

sdgathman · 2019-06-27T19:42:42Z

More important than testing for a long time is a test case to add to unit testing. Did you save the email that made it fail? Maybe just the Subject header field to add to a test email..

william6502 · 2019-06-27T23:25:09Z

The testing mail is pretty simple, just change an existing mail subject line to contain some utf-8 characters, then it will crash the milter. I didn't create a pull request as my patch require change in the milter to work with new header return (bytes) which will break a lot of existing milter. I'm using my patched version on a logistic company email server which mails tend to be from different countries/mail clients and language encoding, so far it works pretty well.

apircalabu · 2019-06-27T23:53:48Z

More important than testing for a long time is a test case to add to unit testing. Did you save the email that made it fail? Maybe just the Subject header field to add to a test email..

Perhaps a unit test against an email that has this string as Subject value?
b'\xC0\xB4\xD7\xD4\x71\x71\x2E\x63\x6F\x6D\xB5\xC4\xCD\xCB\xD0\xC5'

scandox · 2019-07-25T14:52:48Z

I'm having a slightly different experience with the UTF-8 characters in Subject header. In the case of UTF-8 bytes being present in the header, the milter does not crash. However when non-UTF8 (and by extension non-ascii) bytes are present it does crash.

So for example the byte 163 "£" symbol (8859-1) in a subject line will cause the milter to crash. However the equivalent in UTF-8 (0xC2 0xA3) is causing no problem.

If I use @william6502 's fork and return bytes then the milter does not crash. As it happens because Postfix tolerates these invalid bytes in Headers, we have to be able to ignore/handle them. At present our Milter cannot deal with these emails.

We are running Python 3.5.3

sdgathman · 2019-07-25T21:22:50Z

I think what is needed is for the low level header callback to pass bytes. Then have another layer that can be hooked. I.e. the non-OO python callback is in Milter.init.py and should take bytes, converting via a supplied encoding before calling the OO header method. The encoding can be set to None for the OO header method to get bytes, and I can introduce an encoding exception method as well.

What do you think?

sdgathman · 2019-07-25T21:24:55Z

Encoding exceptions in a callback are somewhat awkward. Hey, I think the encoding should be supplied via a decorator, since that works well for a lot of the other callback options!

sdgathman · 2019-07-25T21:30:03Z

What about the other way? To make the low level addheader and chgheader take bytes, we need to null terminate the bytes in miltermodule, since the Python C API does not provide that function in python3. Or did I miss it?

Wait, isn't there an encoding that is essentially "hi bytes zero"? How do we make the C API use that? *** Goes to search docs...

scandox · 2019-07-25T22:13:02Z

Well I would very much defer to your judgement. I don't know half as much as I should about how libmilter or pymilter work - except that they're a huge boon to me!

Your first suggestion sounds good if I understood it. If I understand the situation correctly there's probably a lot of code out there with a header callback that expects a string (including probably in my infra). If I could add a decorator that caused it to receive bytes instead that would be fantastic from a useability / compatibility point of view.

sdgathman · 2019-08-10T15:31:26Z

In reviewing this long standing issue, I realized that chgheader is the only API making bytes for everything at the low level inconsistent. True, py3 offers no 'y' equivalent of 'z'. BUT, I can simply test for None, and alter the chgheader call accordingly. So I can make everything bytes. Then, we can patch up compatibility stuff in python.

andredalle · 2019-08-19T09:34:22Z

Bytes everywhere would be great. For what I do, and I suspect most common usage, I just want to get the whole email as bytes, and then use email.message_from_binary_file() in the eom() callback where everything interesting happens.

sdgathman · 2019-08-20T22:37:04Z

@scandox Added test case that gets this traceback in python3: a01f598

ERROR: testHeader (testsample.BMSMilterTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/stuart/src/pymilter/testsample.py", line 33, in testHeader
    rc = ctx._feedFile(fp)
  File "/home/stuart/src/pymilter/Milter/testctx.py", line 259, in _feedFile
    msg = mime.message_from_file(fp)
  File "/home/stuart/src/pymilter/mime.py", line 310, in message_from_file
    msg = message_from_binary_file(fp,MimeMessage)
  File "/usr/lib64/python3.7/email/__init__.py", line 62, in message_from_binary_file
    return BytesParser(*args, **kws).parse(fp)
  File "/usr/lib64/python3.7/email/parser.py", line 110, in parse
    return self.parser.parse(fp, headersonly)
  File "/usr/lib64/python3.7/email/parser.py", line 54, in parse
    data = fp.read(8192)
  File "/usr/lib64/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 156: invalid start byte

Does that cover the problems in production? Do I need another test case?

sdgathman · 2019-08-20T23:02:06Z

Nah, test case was not reading file as binary. Now it uncovers a bug in testctx.

sdgathman · 2019-08-27T23:51:12Z

Once again, the built-in email package utterly fails me. >:-( While it seems to be able to parse the malformed test file with email.message_from_binary_file(fp), and the invalid header results in a header object, which (reasonably) does not convert to a str, there doesn't seem to be any way to get the actual bytes of the header. I need that to be able to pass bytes to callbacks in the test frameworks - if I am going to change the low level API.

sdgathman · 2019-08-28T00:00:10Z

I may have to port mimetools from python2.7. :-(

sdgathman · 2019-08-28T00:23:51Z

The screwup happens early in email.parser.BytesParser.parse:

  fp = TextIOWrapper(fp, encoding='ascii', errors='surrogateescape')

sdgathman · 2019-08-28T01:54:54Z

surrogateescape saves the original bytes in a way that encode can recover. So, 4749f0f now produces this on the invalid header bytes test case:

ERROR: testHeader (testsample.BMSMilterTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/stuart/src/pymilter/testsample.py", line 33, in testHeader
    rc = ctx._feedFile(fp)
  File "/home/stuart/src/pymilter/Milter/testctx.py", line 287, in _feedFile
    rc = self._header(h,val)
  File "/home/stuart/src/pymilter/Milter/testctx.py", line 236, in _header
    return Milter.header_callback(self,fld,v)
  File "/home/stuart/src/pymilter/Milter/__init__.py", line 709, in header_callback
    return m.header(fld,s)
  File "/home/stuart/src/pymilter/sample.py", line 105, in header
    self.log('%s: %s' % (name,val))
  File "/home/stuart/src/pymilter/sample.py", line 27, in log
    for i in msg: print(i,end=None)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 9-12: surrogates not allowed

At this point, an application could recover the original bytes like I did in testctx.py:

b = s.encode(encoding='ascii',errors='surrogateescape')

But, with this test case added, I'll work on adding a decorator for header callback.

sdgathman · 2019-12-31T23:45:22Z

Instead of a decorator, I think it is cleaner to add a header_bytes() method to Milter.Base, which is invoked on callback:

  def header_bytes(self,fld,val):
     s = val.decode(encoding='ascii',errors='surrogateescape')
     return self.header(fld,s)

Applications could then override header_bytes() to deal directly with bytes, or header() to get a surrogate escaped unicode str. Any objections? Suggestions for better name?

sdgathman · 2020-01-01T00:07:37Z

Ok, that solution makes protocol_mask() too complicated, as it then has to deal with an alias for the header call back. How about we just pass the original bytes as a keyword arg which the app can ignore? No, that requires the encode even if not used. How about if apps just do:

class MyMilter(Milter.Base):
  def __init__():
    self.header_bytes = self.header
    ...

Is that clean enough to skip the surrogate encoding? Maybe a class decorator could do that?

sdgathman · 2020-01-01T01:12:52Z

Next idea: have default header_bytes pass str decoded as 'ascii'. If that fails, pass the bytes instead. That counts as an api change - is it too late to make that change?

JPT580 · 2020-01-30T16:27:16Z

Just a quick question since i stumbled upon this issue, too: Is the fix ready to be pushed to pypi.org as a new version?

scandox · 2020-01-30T16:37:21Z

@JPT580 I think @sdgathman is still cogitating on the best way to fix it. I think there's a lot of different considerations involved reviewing this thread. So not an easy decision.

We are using @william6502 's fork which works for our purposes but which might not suit all cases.

JPT580 · 2020-01-30T16:48:51Z

@scandox Thanks for your reply!
I will take a look into getting that fork running.

sdgathman · 2020-01-30T17:13:03Z

@JPT580

Just a quick question since i stumbled upon this issue, too: Is the fix ready to be pushed to pypi.org as a new version?

The current version on github has bytes available via surrogate encoding. It looks like pypi.org has 1.0.4. Let me check if that includes surrogate encoding.

Going forward, I want to have both header_bytes and header callbacks. The idea is for milter module to invoke header_bytes callback, which in turn invokes header callback with bytes decoded from ascii to str with surrogates. The tricky part is making sure the magic callback decorators still work.

You have always been able to hook the low level module callback to get bytes.

sdgathman · 2020-06-18T23:43:10Z

Here is the proposed API for header callback (similar for envfrom and envrcpt):
https://pymilter.org/pymilter/namespaceMilter.html#a35e41d682d3eb2121bc9644d82cae624

Please comment.

Still to be considered: header field names MUST be 7-bit ascii - but there will no doubt be emails with illegal bytes in the field name. Current behaviour is miltermodule.c gets an internal decoding exception, and returns 4xx error. Not ideal, since it looks like a bug in the milter.

I suppose I have to add the same mechanics for field name as field value.

Second issue: helo MUST be a FQDN. But half the time it isn't, and I wouldn't be surprised if it often has illegal bytes for utf-8. So helo probably needs the same treatment. Currently, illegal utf-8 bytes in helo trigger a 4xx response.

Third issue: going through header_bytes on the way to header callback could be a performance issue. You can get the same behaviour as @decode('bytes') by assigning:

mymilter.header_bytes = mymilter.header

I don't have good benchmarks to see how important this is. Same for envrcpt. The header_bytes method currently replaces itself with header the first time it is called when header has the @decode('bytes') decorator. This is kind of a hack.

mschiff · 2020-10-29T15:33:16Z

Hi @sdgathman, we are facing the same issues and currently we patch pymilter as suggested by @william6502

But that way we not only have to adapt our own custom milters, but also have to patch other python milters we use.

Is there any code already we might give a try? Or would you be willing to accept patches / PRs that follow your approach in the proposed API?

sdgathman · 2020-11-19T22:35:51Z

Is there any code already we might give a try? Or would you be willing to accept patches / PRs that follow your approach in the proposed API?

The code has been posted since June. Would if help it I make a "pre" release?

sdgathman · 2021-01-10T03:17:41Z

Did some debugging with new framework: e7592c6

sdgathman · 2021-03-16T19:26:52Z

On Tue, 16 Mar 2021, Attila Nagy wrote: Any news on this? With actual master, I get this for Subject: árvíztűrő tükörfúrógép in header(self, name, value): >>> print(repr(name), repr(value)) 'Subject' '\udce1rv\udcedzt\udcfbr\udcf5 t\udcfck\udcf6rf\udcfar\udcf3g\udce 9p' As a network server, I expect these to be bytes, because the use can specify anything (like here, latin2 characters).

With current API, you can either override header_bytes(), or use the @decode('bytes') decorator on header().

bra-fsn · 2021-03-16T20:19:19Z

On Tue, 16 Mar 2021, Attila Nagy wrote:

Any news on this? With actual master, I get this for Subject: árvíztűrő
tükörfúrógép in header(self, name, value):

print(repr(name), repr(value))
'Subject' '\udce1rv\udcedzt\udcfbr\udcf5 t\udcfck\udcf6rf\udcfar\udcf3g\udce
9p'

As a network server, I expect these to be bytes, because the use can specify
anything (like here, latin2 characters).

With current API, you can either override header_bytes(), or use
the @decode('bytes') decorator on header().

Yeah, sorry, I've read the discussion and deleted the comment . :)

sdgathman · 2021-03-16T21:42:01Z

On Tue, 16 Mar 2021, Attila Nagy wrote: With current API, you can either override header_bytes(), or use the @decode('bytes') decorator on header(). Yeah, sorry, I've read the discussion and deleted the comment . :)

No worries. The new API seems overly complex, which is why I'm still not happy with it. I welcome bug reports and criticism.

kitterma · 2021-11-17T01:57:03Z

I've recently gotten feedback from a German dkimpy-milter user that the current released API is causing problems due to, for example, ISO-85591 "Umlauts". I understand this isn't easy, but it would be really nice to see a release I can use to update dkipmy-milter to properly process non-UTF-8.

sdgathman · 2021-11-18T05:08:12Z

I've recently gotten feedback from a German dkimpy-milter user that the current released API is causing problems due to, for example, ISO-85591 "Umlauts". I understand this isn't easy, but it would be really nice to see a release I can use to update dkipmy-milter to properly process non-UTF-8.

Is that just a matter of using header_bytes (or decorator)?

sdgathman · 2021-11-18T05:13:59Z

I agree. Time for a release. The new API seems workable, and if there are any performance problem, just override header_bytes. Although there may be some interaction with the negotiate API.

sdgathman self-assigned this Aug 6, 2018

dkg mentioned this issue Feb 20, 2019

Correctly document that body callback chunks are in bytes #28

Merged

kitterma mentioned this issue Nov 17, 2021

In Python 3.10, PY_SSIZE_T_CLEAN must be defined before including Python.h #44

Closed

str vs bytes decisions #12

str vs bytes decisions #12

Comments

sdgathman commented Nov 1, 2016

sdgathman commented Mar 20, 2017

william6502 commented Aug 7, 2018

sdgathman commented Aug 7, 2018

william6502 commented Aug 7, 2018

apircalabu commented Jun 27, 2019 • edited Loading

william6502 commented Jun 27, 2019

apircalabu commented Jun 27, 2019

sdgathman commented Jun 27, 2019

william6502 commented Jun 27, 2019

apircalabu commented Jun 27, 2019

scandox commented Jul 25, 2019

sdgathman commented Jul 25, 2019 • edited Loading

sdgathman commented Jul 25, 2019

sdgathman commented Jul 25, 2019 • edited Loading

scandox commented Jul 25, 2019 • edited Loading

sdgathman commented Aug 10, 2019

andredalle commented Aug 19, 2019

sdgathman commented Aug 20, 2019 • edited Loading

sdgathman commented Aug 20, 2019

sdgathman commented Aug 27, 2019

sdgathman commented Aug 28, 2019

sdgathman commented Aug 28, 2019

sdgathman commented Aug 28, 2019 • edited Loading

sdgathman commented Dec 31, 2019

sdgathman commented Jan 1, 2020

sdgathman commented Jan 1, 2020

JPT580 commented Jan 30, 2020

scandox commented Jan 30, 2020

JPT580 commented Jan 30, 2020

sdgathman commented Jan 30, 2020 • edited Loading

sdgathman commented Jun 18, 2020 • edited Loading

mschiff commented Oct 29, 2020

sdgathman commented Nov 19, 2020 • edited Loading

sdgathman commented Jan 10, 2021

sdgathman commented Mar 16, 2021 via email

bra-fsn commented Mar 16, 2021

sdgathman commented Mar 16, 2021 via email

kitterma commented Nov 17, 2021

sdgathman commented Nov 18, 2021

sdgathman commented Nov 18, 2021

apircalabu commented Jun 27, 2019 •

edited

Loading

sdgathman commented Jul 25, 2019 •

edited

Loading

sdgathman commented Jul 25, 2019 •

edited

Loading

scandox commented Jul 25, 2019 •

edited

Loading

sdgathman commented Aug 20, 2019 •

edited

Loading

sdgathman commented Aug 28, 2019 •

edited

Loading

sdgathman commented Jan 30, 2020 •

edited

Loading

sdgathman commented Jun 18, 2020 •

edited

Loading

sdgathman commented Nov 19, 2020 •

edited

Loading