Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make some improvements in canonicalize #139

Open
5 tasks
boolean5 opened this issue Jun 4, 2020 · 0 comments
Open
5 tasks

Make some improvements in canonicalize #139

boolean5 opened this issue Jun 4, 2020 · 0 comments

Comments

@boolean5
Copy link
Contributor

boolean5 commented Jun 4, 2020

The discussion that took place while writing the unit tests for canonicalize highlighted several improvements that can be made. Specifically:

  • Use urlparse to extract the hostname and the path from the url instead of using a regular expression. This is expected to simplify the code significantly.

  • Add a docstring that will mention the differences between our implementation and the one suggested in the Safe Browsing v2 docs.

  • Add a couple of extra test cases:

    • A test case that checks that escape sequences for the tab, CR and LF characters (i.e., %09, %0d and %0a) are not removed (this is one of the canonicalization rules mentioned in the Safe Browsing v2 docs but it's not covered by the suggested test cases).

        ("retain_escaped_tab_cr_lf", "http://www.google.com/foo%09bar%0dbaz%0a2", "www.google.com/foo%09bar%0Dbaz%0A2"),
      
    • A test case that checks that a url is canonicalized correctly when it contains a
      username and password.

  • Remove a few TODO comments that don't seem to be important any more (the ones that suggest using d, _subs_made = re.subn(...) instead of d = re.subn(...)[0]: 1, 2, 3).

  • I would also suggest renaming the d parameter of canonicalize to domain or url. Avoiding single-letter variable names makes for more readable code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants