Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Main to Py3 #171

Merged
merged 46 commits into from
Nov 6, 2020
Merged

Update Main to Py3 #171

merged 46 commits into from
Nov 6, 2020

Conversation

say-yawn
Copy link
Contributor

@say-yawn say-yawn commented Nov 6, 2020

About this PR

Related to #108, this is the most chunk of work that was needed to update list creation script as Python 3 compatible. Some important changes were:

  • The tracking and the entity lists are in alphanumeric order now (Order domains alphanumerically in safebrowsing lists #166)
    • The dummytracker domains are exempt from ordering and remain at the top of the lists
    • In the case of entity lists, URLs are primarily sorted based on the entity name and there is a secondary alphanumeric ordering within each entity
  • The checksum is fixed where the reading of the header is delimited by new line rather than number of bytes/string (Port script to Python 3 #163) this line

boolean5 and others added 30 commits July 29, 2020 12:35
Implement unit tests to validate Safe Browsing v2 blocklist generation.
Include a test where the versioned test domain is not included in the
blocklist and another test where both test domains are omitted, in order
to ensure full branch coverage. Closes #137
Implement a test to check that write_safebrowsing_blocklist works as
expected even when no output file is provided. Add a comment to explain
when this happens.
Implement unit tests for process_entity_whitelist and
process_plugin_blocklist. Include tests that check that these functions
work as expected even when no log file is provided. Additionally, modify
one of the canonicalize test cases so that it contains a valid utf-8
byte sequence.
Implement unit tests for get_entity_lists. Include a testcase where the
version is smaller than the version in which large entities separation
started, as well as testcases where large entities are separated.
Move the plugin list creation logic from main to a separate
get_plugin_lists function, in order to be consistent with
get_tracker_lists and get_entity_lists and to facilitate testing.
Raise a ValueError if the 'blocklist' key of a plugin configuration
section is empty. Previously, this would lead to the creation of an
empty plugin blocklist.
Add unit tests for blocklist and entity list generation
Add a unit test that validates the implementation of
get_domains_from_category_filters and another one that confirms that it
raises a ValueError when receiving invalid input. Also, add a sample
blocklist JSON file to be used in the tests.
Add domains with tags to sample_blocklist.json in order to use them in
tests. Implement a simple test just with category filters, a test with
category exclusion filters and a test with the "performance" tag.
Implement unit tests for get_tracker_lists. Include testcases to cover
the following configuration section scenarios:

  * It does not specify categories and the default are used
  * It specifies categories
  * It specifies excluded categories
  * It specifies tags
  * It contains an invalid tag and an error is raised
  * It contains a version key

Finally, add one more domain to the Cryptomining category in
tests/sample_blocklist.json to facilitate tag testing.
Add unit tests for domain filtering and get_tracker_lists
Implement a test that validates loading the JSON entity list from a URL.
Add another test to cover the case where opening the URL fails with an
exception. Additionally, make load_json_from_url include the type and
message of this exception in the error message it prints to stderr.
…emoteSettings

Handle requests.exceptions.ReadTimeout
According to changes on mozilla-services/shavar-prod-lists#284 we are now using "FingerprintInvasive" instead of "Fingerprint" category
Rename the Fingerprinting category to FingerprintingInvasive in the
tests and in sample_blocklist.json. Also, update an example in the
docstring of get_domains_from_filters with the new category name.
Return the checksum of the contents of the list file starting right
after the chunk header. Previously, we were always starting from the
26th byte, regardless of the size of the header.
Move the code that uses the configuration file to determine if a list
should be uploaded to S3 outside of `new_data_to_publish_to_s3` and into
`publish_to_cloud`. Rename `check_upload_remote_settings_config` to
`check_upload_config` and adjust it to work both for S3 and remote
settings. In `publish_to_cloud` make sure that
`new_data_to_publish_to_s3` is only called when `upload_to_s3` is True,
to avoid the cost of unnecessarily accessing S3.

In addition, when the key corresponding to a list does not yet exist on
S3, return True right away instead of creating a key with dummy list
contents and comparing against its checksum.

Also, fix a minor bug in getting the name of the S3 key from the
configuration file. Previously, if the section did not have the `s3_key`
option, `new_data_to_publish_to_s3` would raise a `NoOptionError`
instead of falling back to using the `output` option as the S3 key name.
Raise a ValueError when the `s3_key` option exists but is empty.

Finally, add a docstring to describe the functionality of
`new_data_to_publish_to_s3`.
Add moto as a requirement and implement unit tests for
new_data_to_publish_to_s3.
Add unit tests to check that the expected S3 key permissions are set in
new_data_to_publish_to_s3 and publish_to_s3. These tests are currently
expected to fail because setting key permissions with add_user_grant()
does not work in moto.
…ategory-on-sample-config

Update sample configuration with new fingerprinting category name
Sort the canonicalized domains alphanumerically before writing their
hashes to the safebrowsing list file. This ensures that any changes in
the order of domains in the input JSON lists will not cause unnecessary
list updates. Additionally, this is a prerequisite for porting the list
creation script to Python 3. Hash randomization is enabled by default
starting from Python 3.3 and this would result in a differently ordered
list being created every time the script is run.

Also, replace the previous_domains set with a previous_domain string.
Now that the domains are ordered, if there are duplicates they will be
consecutive. Finally, adjust the tests to the changes and make sure that
write_safebrowsing_blocklist tests check both the order of hashes and
the existence of duplicates.

Note: the dummytracker domains are exempt from ordering and remain at
the top of the list file.
Sort the canonicalized domains alphanumerically before writing their
hashes to the safebrowsing list file. Replace the previous domains set
with a previous_domain string. Finally, adjust the tests to the changes
and make sure that process_plugin_blocklist tests check both the order
of hashes and the existence of duplicates.
boolean5 and others added 16 commits August 17, 2020 21:41
Sort the canonicalized URLs of each entity alphanumerically before
writing their hashes to the safebrowsing list file. URLs are still
sorted based on the entity name and this commit introduces a secondary
ordering within each entity. Use a set to store canonicalized URLs
before sorting them, so that duplicates are eliminated. Finally, make
sure that process_entitylist tests check both the order of hashes and
the existence of duplicates.
Add unit tests for publish2cloud.py
Order domains alphanumerically in safebrowsing lists
Make sure to encode string objects to bytes and decode byte objects to
strings wherever necessary, as a preparatory step for transitioning to
Python 3. More specifically:

  * Encode URLs before hashing them, as hashlib.sha256() expects a bytes
    object in Python 3
  * Convert chunk headers to bytes before writing them to a binary file
  * Decode data right after receiving it from urlopen
  * Decode the content of HTTP responses before printing it
  * Decode data right after reading it from a binary file
  * Encode the data used to mock urlopen in tests

Also, fix a minor bug by using the 'b' prefix at the start of each line
of a multi-line bytes literal instead of only using it for the first
line. Finally, omit the "utf-8" argument when encoding, as UTF-8 is the
default encoding.
Apply the following changes to make the list creation script compatible
with Python 3. After these changes it will no longer support Python 2.

  * Import configparser instead of ConfigParser, as this module was
    renamed in Python 3
  * Import quote and unquote from urllib.parse and urlopen from
    urllib.request instead of importing urllib2
  * Remove mock from the dependencies as it is part of the standard
    library in Python 3
  * Pin dependencies to the latest version
  * Use %d instead of %s when formatting a binary chunk header to avoid
    getting a TypeError. Also, use %d instead of %u, which is deprecated
  * Set the "version" key equal to the empty string instead of None when
    reverting configuration to avoid getting a "TypeError: option values
    must be strings".
  * Update CircleCI configuration to use a docker image with Python 3.8
  * Update Python version requirement in README
  * Update README instructions to use virtualenv with Python 3

Closes #108.
Use config.read_file() to read configuration from a file instead of
config.readfp(), which is deprecated. Also, use ConfigParser instead of
SafeConfigParser, which is an alias that will be removed in future
Python versions.
Update the instructions in the README to use the new test requirements
file. Also, update the dependency caching key in the configuration of
CircleCI, so that it includes the checksum of the test requirements file
as well.
Update README and CircleCI config to use requirements-test.txt
Copy link
Contributor

@groovecoder groovecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code & logic LGTM.

@say-yawn say-yawn merged commit c62abc4 into main Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants