Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str vs bytes decisions #12

Open
sdgathman opened this issue Nov 1, 2016 · 40 comments
Open

str vs bytes decisions #12

sdgathman opened this issue Nov 1, 2016 · 40 comments
Assignees

Comments

@sdgathman
Copy link
Owner

It is fairly easy to chose between str and bytes in the C API, it is typically choosing s or y in the format strings for PyArg_Parse and Py_BuildValue. One exception is that there is no equivalent of z (allow None and convert to NULL) for y. This prevents simply making everything bytes, even though the C side does everything in 8-bits.

Obviously, body and replacebody are bytes. Pathname (from connect callback for unix socket) is recommended by C-API docs to be str, converted by a standard pathname converter (which handles unix vs windows, etc).

The header and addheader I made bytes, but then chgheader has to deal with passing None, so I can't simply use the y format, and making chgheader str while the other two are bytes would be inconsistent.

@sdgathman
Copy link
Owner Author

After several months to reflect, I think header, addheader should be str to be consistent with chgheader. There may be useful things to do with encoding in the future. (internationalizing, for instance)

@sdgathman sdgathman self-assigned this Aug 6, 2018
@william6502
Copy link

i just happened to received some emails with non-ascii chars in the subject line without properly encoding, pymilter threw "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 8: invalid start byte".. I don't know where to look at because there was no traceback..:-(

@sdgathman
Copy link
Owner Author

Is this with python3?

@william6502
Copy link

yea..using Python 3.5.1

dkg added a commit to dkg/pymilter that referenced this issue Feb 20, 2019
sdgathman#12 says "Obviously, body
and replacebody are bytes" and milter_wrap_body in miltermodule.c
says:

   arglist = Py_BuildValue("(Oy#)", c, bodyp, bodylen);
…

So pymilter should sport the correct documentation.
sdgathman pushed a commit that referenced this issue Feb 21, 2019
#12 says "Obviously, body
and replacebody are bytes" and milter_wrap_body in miltermodule.c
says:

   arglist = Py_BuildValue("(Oy#)", c, bodyp, bodylen);
…

So pymilter should sport the correct documentation.
sdgathman pushed a commit that referenced this issue Feb 22, 2019
#12 says "Obviously, body
and replacebody are bytes" and milter_wrap_body in miltermodule.c
says:

   arglist = Py_BuildValue("(Oy#)", c, bodyp, bodylen);
…

So pymilter should sport the correct documentation.
@apircalabu
Copy link

apircalabu commented Jun 27, 2019

i just happened to received some emails with non-ascii chars in the subject line without properly encoding, pymilter threw "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa6 in position 8: invalid start byte".. I don't know where to look at because there was no traceback..:-(

Same here. How do I fix this? I'm using Python 3.6.8 on CentOS 7 and I get a similar error. Tested on both python36-pymilter-1.0.3-2.el7 rpm and on 1.0.4. Same code works with Python 2.7.5.

$ python
Python 2.7.5 (default, Jun 20 2019, 20:27:34)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> val=b'\xc0\xb4\xd7\xd4'
>>> val.decode()
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 0: ordinal not in range(128)
>>> val.decode('latin-1')
u'\xc0\xb4\xd7\xd4'
>>> chardet.detect(val)
{'confidence': 0.7679697235616183, 'language': 'Russian', 'encoding': 'KOI8-R'}
>>> val.decode('KOI8-R')
u'\u044e\u2562\u0432\u0442'

$ python3
Python 3.6.8 (default, Apr 25 2019, 21:02:35)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import chardet
>>> val=b'\xc0\xb4\xd7\xd4'
>>> val.decode()
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte
>>> val.decode('latin-1')
'À´×Ô'
>>> chardet.detect(val)
{'encoding': 'KOI8-R', 'confidence': 0.7679697235616183}
>>> val.decode('KOI8-R')
'ю╢вт'

@william6502
Copy link

i fixed it in my pymilter fork and tested on a production server for a long time, so far so good.

@apircalabu
Copy link

Great, now testing, thanks.

@sdgathman
Copy link
Owner Author

More important than testing for a long time is a test case to add to unit testing. Did you save the email that made it fail? Maybe just the Subject header field to add to a test email..

@william6502
Copy link

The testing mail is pretty simple, just change an existing mail subject line to contain some utf-8 characters, then it will crash the milter. I didn't create a pull request as my patch require change in the milter to work with new header return (bytes) which will break a lot of existing milter. I'm using my patched version on a logistic company email server which mails tend to be from different countries/mail clients and language encoding, so far it works pretty well.

@apircalabu
Copy link

More important than testing for a long time is a test case to add to unit testing. Did you save the email that made it fail? Maybe just the Subject header field to add to a test email..

Perhaps a unit test against an email that has this string as Subject value?
b'\xC0\xB4\xD7\xD4\x71\x71\x2E\x63\x6F\x6D\xB5\xC4\xCD\xCB\xD0\xC5'

@scandox
Copy link

scandox commented Jul 25, 2019

I'm having a slightly different experience with the UTF-8 characters in Subject header. In the case of UTF-8 bytes being present in the header, the milter does not crash. However when non-UTF8 (and by extension non-ascii) bytes are present it does crash.

So for example the byte 163 "£" symbol (8859-1) in a subject line will cause the milter to crash. However the equivalent in UTF-8 (0xC2 0xA3) is causing no problem.

If I use @william6502 's fork and return bytes then the milter does not crash. As it happens because Postfix tolerates these invalid bytes in Headers, we have to be able to ignore/handle them. At present our Milter cannot deal with these emails.

We are running Python 3.5.3

@sdgathman
Copy link
Owner Author

sdgathman commented Jul 25, 2019

I think what is needed is for the low level header callback to pass bytes. Then have another layer that can be hooked. I.e. the non-OO python callback is in Milter.init.py and should take bytes, converting via a supplied encoding before calling the OO header method. The encoding can be set to None for the OO header method to get bytes, and I can introduce an encoding exception method as well.

What do you think?

@sdgathman
Copy link
Owner Author

Encoding exceptions in a callback are somewhat awkward. Hey, I think the encoding should be supplied via a decorator, since that works well for a lot of the other callback options!

@sdgathman
Copy link
Owner Author

sdgathman commented Jul 25, 2019

What about the other way? To make the low level addheader and chgheader take bytes, we need to null terminate the bytes in miltermodule, since the Python C API does not provide that function in python3. Or did I miss it?

Wait, isn't there an encoding that is essentially "hi bytes zero"? How do we make the C API use that? *** Goes to search docs...

@scandox
Copy link

scandox commented Jul 25, 2019

Well I would very much defer to your judgement. I don't know half as much as I should about how libmilter or pymilter work - except that they're a huge boon to me!

Your first suggestion sounds good if I understood it. If I understand the situation correctly there's probably a lot of code out there with a header callback that expects a string (including probably in my infra). If I could add a decorator that caused it to receive bytes instead that would be fantastic from a useability / compatibility point of view.

@sdgathman
Copy link
Owner Author

In reviewing this long standing issue, I realized that chgheader is the only API making bytes for everything at the low level inconsistent. True, py3 offers no 'y' equivalent of 'z'. BUT, I can simply test for None, and alter the chgheader call accordingly. So I can make everything bytes. Then, we can patch up compatibility stuff in python.

@andredalle
Copy link

Bytes everywhere would be great. For what I do, and I suspect most common usage, I just want to get the whole email as bytes, and then use email.message_from_binary_file() in the eom() callback where everything interesting happens.

@sdgathman
Copy link
Owner Author

sdgathman commented Aug 20, 2019

@scandox Added test case that gets this traceback in python3: a01f598

ERROR: testHeader (testsample.BMSMilterTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/stuart/src/pymilter/testsample.py", line 33, in testHeader
    rc = ctx._feedFile(fp)
  File "/home/stuart/src/pymilter/Milter/testctx.py", line 259, in _feedFile
    msg = mime.message_from_file(fp)
  File "/home/stuart/src/pymilter/mime.py", line 310, in message_from_file
    msg = message_from_binary_file(fp,MimeMessage)
  File "/usr/lib64/python3.7/email/__init__.py", line 62, in message_from_binary_file
    return BytesParser(*args, **kws).parse(fp)
  File "/usr/lib64/python3.7/email/parser.py", line 110, in parse
    return self.parser.parse(fp, headersonly)
  File "/usr/lib64/python3.7/email/parser.py", line 54, in parse
    data = fp.read(8192)
  File "/usr/lib64/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 156: invalid start byte

Does that cover the problems in production? Do I need another test case?

@sdgathman
Copy link
Owner Author

Nah, test case was not reading file as binary. Now it uncovers a bug in testctx.

@sdgathman
Copy link
Owner Author

Once again, the built-in email package utterly fails me. >:-( While it seems to be able to parse the malformed test file with email.message_from_binary_file(fp), and the invalid header results in a header object, which (reasonably) does not convert to a str, there doesn't seem to be any way to get the actual bytes of the header. I need that to be able to pass bytes to callbacks in the test frameworks - if I am going to change the low level API.

@sdgathman
Copy link
Owner Author

I may have to port mimetools from python2.7. :-(

@sdgathman
Copy link
Owner Author

The screwup happens early in email.parser.BytesParser.parse:

  fp = TextIOWrapper(fp, encoding='ascii', errors='surrogateescape')

@sdgathman
Copy link
Owner Author

sdgathman commented Aug 28, 2019

surrogateescape saves the original bytes in a way that encode can recover. So, 4749f0f now produces this on the invalid header bytes test case:

ERROR: testHeader (testsample.BMSMilterTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/stuart/src/pymilter/testsample.py", line 33, in testHeader
    rc = ctx._feedFile(fp)
  File "/home/stuart/src/pymilter/Milter/testctx.py", line 287, in _feedFile
    rc = self._header(h,val)
  File "/home/stuart/src/pymilter/Milter/testctx.py", line 236, in _header
    return Milter.header_callback(self,fld,v)
  File "/home/stuart/src/pymilter/Milter/__init__.py", line 709, in header_callback
    return m.header(fld,s)
  File "/home/stuart/src/pymilter/sample.py", line 105, in header
    self.log('%s: %s' % (name,val))
  File "/home/stuart/src/pymilter/sample.py", line 27, in log
    for i in msg: print(i,end=None)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 9-12: surrogates not allowed

At this point, an application could recover the original bytes like I did in testctx.py:

b = s.encode(encoding='ascii',errors='surrogateescape')

But, with this test case added, I'll work on adding a decorator for header callback.

@sdgathman
Copy link
Owner Author

Instead of a decorator, I think it is cleaner to add a header_bytes() method to Milter.Base, which is invoked on callback:

  def header_bytes(self,fld,val):
     s = val.decode(encoding='ascii',errors='surrogateescape')
     return self.header(fld,s)

Applications could then override header_bytes() to deal directly with bytes, or header() to get a surrogate escaped unicode str. Any objections? Suggestions for better name?

@sdgathman
Copy link
Owner Author

Ok, that solution makes protocol_mask() too complicated, as it then has to deal with an alias for the header call back. How about we just pass the original bytes as a keyword arg which the app can ignore? No, that requires the encode even if not used. How about if apps just do:

class MyMilter(Milter.Base):
  def __init__():
    self.header_bytes = self.header
    ...

Is that clean enough to skip the surrogate encoding? Maybe a class decorator could do that?

@sdgathman
Copy link
Owner Author

Next idea: have default header_bytes pass str decoded as 'ascii'. If that fails, pass the bytes instead. That counts as an api change - is it too late to make that change?

@JPT580
Copy link

JPT580 commented Jan 30, 2020

Just a quick question since i stumbled upon this issue, too: Is the fix ready to be pushed to pypi.org as a new version?

@scandox
Copy link

scandox commented Jan 30, 2020

@JPT580 I think @sdgathman is still cogitating on the best way to fix it. I think there's a lot of different considerations involved reviewing this thread. So not an easy decision.

We are using @william6502 's fork which works for our purposes but which might not suit all cases.

@JPT580
Copy link

JPT580 commented Jan 30, 2020

@scandox Thanks for your reply!
I will take a look into getting that fork running.

@sdgathman
Copy link
Owner Author

sdgathman commented Jan 30, 2020

@JPT580

Just a quick question since i stumbled upon this issue, too: Is the fix ready to be pushed to pypi.org as a new version?

The current version on github has bytes available via surrogate encoding. It looks like pypi.org has 1.0.4. Let me check if that includes surrogate encoding.

Going forward, I want to have both header_bytes and header callbacks. The idea is for milter module to invoke header_bytes callback, which in turn invokes header callback with bytes decoded from ascii to str with surrogates. The tricky part is making sure the magic callback decorators still work.

You have always been able to hook the low level module callback to get bytes.

@sdgathman
Copy link
Owner Author

sdgathman commented Jun 18, 2020

Here is the proposed API for header callback (similar for envfrom and envrcpt):
https://pymilter.org/pymilter/namespaceMilter.html#a35e41d682d3eb2121bc9644d82cae624

Please comment.

Still to be considered: header field names MUST be 7-bit ascii - but there will no doubt be emails with illegal bytes in the field name. Current behaviour is miltermodule.c gets an internal decoding exception, and returns 4xx error. Not ideal, since it looks like a bug in the milter.

I suppose I have to add the same mechanics for field name as field value.

Second issue: helo MUST be a FQDN. But half the time it isn't, and I wouldn't be surprised if it often has illegal bytes for utf-8. So helo probably needs the same treatment. Currently, illegal utf-8 bytes in helo trigger a 4xx response.

Third issue: going through header_bytes on the way to header callback could be a performance issue. You can get the same behaviour as @decode('bytes') by assigning:

mymilter.header_bytes = mymilter.header

I don't have good benchmarks to see how important this is. Same for envrcpt. The header_bytes method currently replaces itself with header the first time it is called when header has the @decode('bytes') decorator. This is kind of a hack.

@mschiff
Copy link

mschiff commented Oct 29, 2020

Hi @sdgathman, we are facing the same issues and currently we patch pymilter as suggested by @william6502

But that way we not only have to adapt our own custom milters, but also have to patch other python milters we use.

Is there any code already we might give a try? Or would you be willing to accept patches / PRs that follow your approach in the proposed API?

@sdgathman
Copy link
Owner Author

sdgathman commented Nov 19, 2020

Is there any code already we might give a try? Or would you be willing to accept patches / PRs that follow your approach in the proposed API?

The code has been posted since June. Would if help it I make a "pre" release?

@sdgathman
Copy link
Owner Author

Did some debugging with new framework: e7592c6

@sdgathman
Copy link
Owner Author

sdgathman commented Mar 16, 2021 via email

@bra-fsn
Copy link

bra-fsn commented Mar 16, 2021

On Tue, 16 Mar 2021, Attila Nagy wrote:

Any news on this? With actual master, I get this for Subject: árvíztűrő
tükörfúrógép in header(self, name, value):

print(repr(name), repr(value))
'Subject' '\udce1rv\udcedzt\udcfbr\udcf5 t\udcfck\udcf6rf\udcfar\udcf3g\udce
9p'

As a network server, I expect these to be bytes, because the use can specify
anything (like here, latin2 characters).

With current API, you can either override header_bytes(), or use
the @decode('bytes') decorator on header().

Yeah, sorry, I've read the discussion and deleted the comment . :)

@sdgathman
Copy link
Owner Author

sdgathman commented Mar 16, 2021 via email

@kitterma
Copy link
Collaborator

I've recently gotten feedback from a German dkimpy-milter user that the current released API is causing problems due to, for example, ISO-85591 "Umlauts". I understand this isn't easy, but it would be really nice to see a release I can use to update dkipmy-milter to properly process non-UTF-8.

@sdgathman
Copy link
Owner Author

I've recently gotten feedback from a German dkimpy-milter user that the current released API is causing problems due to, for example, ISO-85591 "Umlauts". I understand this isn't easy, but it would be really nice to see a release I can use to update dkipmy-milter to properly process non-UTF-8.

Is that just a matter of using header_bytes (or decorator)?

@sdgathman
Copy link
Owner Author

I agree. Time for a release. The new API seems workable, and if there are any performance problem, just override header_bytes. Although there may be some interaction with the negotiate API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants