Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing of author information #521

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

yggi49
Copy link

@yggi49 yggi49 commented Nov 12, 2022

Instead of relying on regular expressions, this patch leverages Python’s builtin email.utils.parseaddr() functionality to parse an RFC-822-compliant email address string into its name and address parts.

This should also resolve issues with special characters in the name part; see for example:

This is a followup to python-poetry/poetry PR #1040, as advised by @Secrus’ comment.

If this PR gets accepted, there will be two additional steps:

  1. Refactor poetry.console.commands.init.InitCommand._validate_author() to use the new parse_author() helper, see https://github.com/python-poetry/poetry/blob/master/src/poetry/console/commands/init.py#L441
  2. Remove AUTHOR_REGEX from poetry.core.packages.package.
  • Added tests for changed code.
  • Updated documentation for changed code.

Instead of relying on regular expressions, this patch leverages Python’s
builtin `email.utils.parseaddr()` functionality to parse an RFC-822-compliant
email address string into its name and address parts.

This should also resolve issues with special characters in the name
part; see for example Poetry issues python-poetry#370 and python-poetry#798.

python-poetry/poetry#370
python-poetry/poetry#798
@@ -118,3 +119,60 @@ def test_utils_helpers_readme_content_type(
readme: str | Path, content_type: str
) -> None:
assert readme_content_type(readme) == content_type


def test_utils_helpers_parse_author():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make it into parametrized test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done; see commit 8286618.

@@ -32,6 +33,8 @@

T = TypeVar("T", bound="Package")

# TODO: once poetry.console.commands.init.InitCommand._validate_author
# uses poetry.core.utils.helpers.parse_author, this can be removed.
AUTHOR_REGEX = re.compile(r"(?u)^(?P<name>[- .,\w\d'’\"():&]+)(?: <(?P<email>.+?)>)?$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you stack this upon #517, please?

Copy link
Member

@finswimmer finswimmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about email.utils before. If we can make it fly, it is clearly preferred over an own regex implementation 👍

I played around a bit with it. For any reasons with blabla <[email protected] I receive blabla as name and [email protected] as email. Which is surprising, because the closing > is missing. 🤔

Comment on lines 126 to 131
if "@" not in address:
return address, None
name, email = parseaddr(address)
if not name and "@" not in email:
return email, None
return name or None, email or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we can remove all the if blocks.

@yggi49
Copy link
Author

yggi49 commented Nov 14, 2022

It seems it will be hard (if not impossible) to make this backward compatible, given the existing tests for valid and invalid author names. test_package_authors_valid() creates a few combinations that contradict RFC-822; on the other hand, test_package_author_names_invalid() has a couple of cases that would actually be valid names.

What’s the general concept in Poetry for author and maintainer strings? Judging from the test suite, it appears that a name is mandatory, whereas an email address is optional. Then again, the ValueError raised in poetry.core.packages.package.Package._get_author() explicitly says that author strings “Must be in the format: John Smith <[email protected]>, indicating that an email address must be provided as well. (The same also goes for poetry.core.packages.package.Package._get_maintainer().)

For reference: According to the PyPA specification (derived from PEP-621), author and maintainer information can be either name-only, email-only, or name and email:

  1. If only name is provided, the value goes in Author or Maintainer as appropriate.

  2. If only email is provided, the value goes in Author-email or Maintainer-email as appropriate.

  3. If both email and name are provided, the value goes in Author-email or Maintainer-email as appropriate, with the format {name} <{email}>.

  4. Multiple values should be separated by commas.

Furthermore, RFC-822-compliance for names is mentioned as well:

The name value MUST be a valid email name (i.e. whatever can be put as a name, before an email, in RFC 822) and not contain commas. The email value MUST be a valid email address.

TL;DR: I think it needs to be decided whether to adopt the PyPA spec (at least in this regard, which would be a breaking change), or if Poetry wants to continue having its own concept of what constitutes valid author/maintainer strings. In the latter case, this PR can probably be closed without further ado.

@finswimmer
Copy link
Member

Hello @yggi49,

thanks for these details 👍 . IMO the goal must be PyPA spec compliance. But maybe in two steps.

Within the current scope of your PR we should switch from the regex implementation to email.utils, but keep the author name mandatory and the email optional. This results in less strict rules for the names - which is backwards compatible. Tests must be adopted accordingly.

In a second PR we can than make name optional as well (unless other maintainers have a veto) and also make sure that name/email goes into the correct metadata fields as described in https://packaging.python.org/en/latest/specifications/declaring-project-metadata/#authors-maintainers.

fin swimmer

@yggi49
Copy link
Author

yggi49 commented Nov 23, 2022

Thank you for providing guidance, @finswimmer. I am going to tweak the logic accordingly (name mandatory, email address optional), and adjust the tests as needed.

@neersighted
Copy link
Member

I'm not a fan of how the tests are structured in this PR; I would like something much more thorough and incorporated to the existing tests (see the tests/additions in #517).

@yggi49
Copy link
Author

yggi49 commented Nov 24, 2022

@finswimmer, I adapted the logic towards PyPA-compliance, with a mandatory name and an optional email address. Check out the test cases for the new parse_author() helper on what is supported and what is disallowed; I hope this matches your expectations. The non-RFC-conformant cases were added due to them being discussed in python-poetry/poetry PR #1040.

@neersighted, my primary goal was to factor out author parsing into a single separate helper method and provide a comprehensive list of test cases for it, to showcase the expected behavior. Initially, I didn’t want to mess around too much with what was there already, but merely adapted the cases of existing tests accordingly. Personally, I think that the tests in test_package.pytest_package_authors_invalid(), test_package_authors_valid(), and test_package_author_names_invalid()—might no longer be needed and could be removed, as everything should already be covered by the tests for parse_author().

@finswimmer
Copy link
Member

finswimmer commented Nov 24, 2022

Thanks for making progress on that @yggi49 👍

I'm still a bit irritated by the behavior of email.utils.parseaddr() in some cases, e.g.:

email.utils.parseaddr("Me [Some Department] <[email protected]>")
('', 'Me')

I would expect ('Me [Some Department] ', '[email protected]').

email.utils.parseaddr("Me <[email protected]")
('Me', '[email protected]')

I would expected either a parsing error (preferred) or ('Me <[email protected], '').

I don't understand why it behaves like this. Expected? Bug? How do we want to handle this?

We could check if we can reconstruct the address that goes into parseaddr, with the returned value and throw an error if not. Bu that would mean, that the example above is really invalid. Where are the specs for this? 🤔

fin swimmer

@yggi49
Copy link
Author

yggi49 commented Nov 24, 2022

If brackets should be part of the name, the name must be quoted—i.e., it should read "Me [Some Department]" <[email protected]>. The same goes for white space and other special characters: ()<>@,;:\"..

Python’s address parsing is implemented in email._parseaddr.AddrlistClass. I didn’t really dive into it, and neither did I do a deep dive into the RFC, so I cannot tell what the unquoted Me [Some Department] <[email protected]> actually means. (Email address strings can get quite complicated, covering a couple of arcane use cases, and I don’t think it’s worth reinventing the wheel here.)

I did implement a poor-man’s reverse check into the parse_author() helper, though, and calling parse_author("Me [Some Department] <[email protected]>") will raise a ValueError.

@finswimmer
Copy link
Member

It's a rabbit hole ... 🐇

I asked at the PyPA discord server about the behavior of email.utils.parseaddr(). And as you've already find out, quoting (double quotes only!) is the key. They came up with another solution for parsing name and email that do also a validation:

>>> from email.headerregistry import AddressHeader
>>> parsed = AddressHeader.value_parser("Me [Some Department] <[email protected]>")
>>> len(parsed.all_defects) > 0
True

>>> parsed = AddressHeader.value_parser('"Me [Some Department]" <[email protected]>')
>>> parsed.all_defects
[]
>>> parsed.mailboxes[0].addr_spec
'[email protected]'
>>> parsed.mailboxes[0].display_name
'Me [Some Department]'

So if we detect an @ in the address we might go this way to parse out name and email. But ...

  • I'm not sure if the current supported values for a name would already require to quote the name. If yes, this would be indeed a breaking change if we now say: please quote your name.
  • Require to quote the name in pyproject.toml is ugly. Maybe it should be the responsible of the backend to quote the name if needed when writing to Author-email metadata. But if the name is not quoted in the pyproject.toml, we need some kind of pre-processing to add some, before we can parse out name and email as described above.

This all leads to the question: Is it all worth it? I don't know and does not have a strong opinion about it. Maybe some of the other maintainers?

@yggi49
Copy link
Author

yggi49 commented Nov 25, 2022

I'm not sure if the current supported values for a name would already require to quote the name. If yes, this would be indeed a breaking change if we now say: please quote your name.

@finswimmer, I already mentioned that this is going to be the case, since the existing regular expression for the name part (?P<name>[- .,\w\d'’\"():&]+) allows several special characters—specifically (),.:", plus space—that the RFC requires to be quoted.

Maybe it’s easier to approach this entire topic from the other side, and make a basic strategic decision for this project first: is PyPA compatiblity the goal, or does Poetry want to keep doing it its own way?

  • In case of the former, this is inevitably going to introduce a breaking change. Plus, you won’t have to worry whether quoted names in pyproject.toml are ugly or not, since that decision has already been made by someone else, as quoted above.

    I don’t think, though, that the breaking change would hurt too much, as the majority of cases probably still could go unquoted. After all, only names that contain special characters require quoting.

  • If Poetry wants to pursue its own rules, it should then be specified what is supported and what isn’t by defining the test cases for the parse_author() helper, and adapt the implementation accordingly.

@sonarcloud
Copy link

sonarcloud bot commented Dec 5, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@ghost
Copy link

ghost commented Feb 8, 2023

@finswimmer, any updates on this front?

@Secrus Secrus added the 2.0 PRs for core 2.0 label Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0 PRs for core 2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants