Update Main to Py3 #171

say-yawn · 2020-11-06T15:59:58Z

About this PR

Related to #108, this is the most chunk of work that was needed to update list creation script as Python 3 compatible. Some important changes were:

The tracking and the entity lists are in alphanumeric order now (Order domains alphanumerically in safebrowsing lists #166)
- The dummytracker domains are exempt from ordering and remain at the top of the lists
- In the case of entity lists, URLs are primarily sorted based on the entity name and there is a secondary alphanumeric ordering within each entity
The checksum is fixed where the reading of the header is delimited by new line rather than number of bytes/string (Port script to Python 3 #163) this line

Implement unit tests to validate Safe Browsing v2 blocklist generation. Include a test where the versioned test domain is not included in the blocklist and another test where both test domains are omitted, in order to ensure full branch coverage. Closes #137

Implement a test to check that write_safebrowsing_blocklist works as expected even when no output file is provided. Add a comment to explain when this happens.

Implement unit tests for process_entity_whitelist and process_plugin_blocklist. Include tests that check that these functions work as expected even when no log file is provided. Additionally, modify one of the canonicalize test cases so that it contains a valid utf-8 byte sequence.

Implement unit tests for get_entity_lists. Include a testcase where the version is smaller than the version in which large entities separation started, as well as testcases where large entities are separated.

Move the plugin list creation logic from main to a separate get_plugin_lists function, in order to be consistent with get_tracker_lists and get_entity_lists and to facilitate testing.

Raise a ValueError if the 'blocklist' key of a plugin configuration section is empty. Previously, this would lead to the creation of an empty plugin blocklist.

Add unit tests for blocklist and entity list generation

Add a unit test that validates the implementation of get_domains_from_category_filters and another one that confirms that it raises a ValueError when receiving invalid input. Also, add a sample blocklist JSON file to be used in the tests.

Add domains with tags to sample_blocklist.json in order to use them in tests. Implement a simple test just with category filters, a test with category exclusion filters and a test with the "performance" tag.

Implement unit tests for get_tracker_lists. Include testcases to cover the following configuration section scenarios: * It does not specify categories and the default are used * It specifies categories * It specifies excluded categories * It specifies tags * It contains an invalid tag and an error is raised * It contains a version key Finally, add one more domain to the Cryptomining category in tests/sample_blocklist.json to facilitate tag testing.

Add unit tests for domain filtering and get_tracker_lists

Implement a test that validates loading the JSON entity list from a URL. Add another test to cover the case where opening the URL fails with an exception. Additionally, make load_json_from_url include the type and message of this exception in the error message it prints to stderr.

Add more unit tests

…emoteSettings Handle requests.exceptions.ReadTimeout

According to changes on mozilla-services/shavar-prod-lists#284 we are now using "FingerprintInvasive" instead of "Fingerprint" category

Rename the Fingerprinting category to FingerprintingInvasive in the tests and in sample_blocklist.json. Also, update an example in the docstring of get_domains_from_filters with the new category name.

Return the checksum of the contents of the list file starting right after the chunk header. Previously, we were always starting from the 26th byte, regardless of the size of the header.

Move the code that uses the configuration file to determine if a list should be uploaded to S3 outside of `new_data_to_publish_to_s3` and into `publish_to_cloud`. Rename `check_upload_remote_settings_config` to `check_upload_config` and adjust it to work both for S3 and remote settings. In `publish_to_cloud` make sure that `new_data_to_publish_to_s3` is only called when `upload_to_s3` is True, to avoid the cost of unnecessarily accessing S3. In addition, when the key corresponding to a list does not yet exist on S3, return True right away instead of creating a key with dummy list contents and comparing against its checksum. Also, fix a minor bug in getting the name of the S3 key from the configuration file. Previously, if the section did not have the `s3_key` option, `new_data_to_publish_to_s3` would raise a `NoOptionError` instead of falling back to using the `output` option as the S3 key name. Raise a ValueError when the `s3_key` option exists but is empty. Finally, add a docstring to describe the functionality of `new_data_to_publish_to_s3`.

Add moto as a requirement and implement unit tests for new_data_to_publish_to_s3.

Add unit tests to check that the expected S3 key permissions are set in new_data_to_publish_to_s3 and publish_to_s3. These tests are currently expected to fail because setting key permissions with add_user_grant() does not work in moto.

…ategory-on-sample-config Update sample configuration with new fingerprinting category name

Sort the canonicalized domains alphanumerically before writing their hashes to the safebrowsing list file. This ensures that any changes in the order of domains in the input JSON lists will not cause unnecessary list updates. Additionally, this is a prerequisite for porting the list creation script to Python 3. Hash randomization is enabled by default starting from Python 3.3 and this would result in a differently ordered list being created every time the script is run. Also, replace the previous_domains set with a previous_domain string. Now that the domains are ordered, if there are duplicates they will be consecutive. Finally, adjust the tests to the changes and make sure that write_safebrowsing_blocklist tests check both the order of hashes and the existence of duplicates. Note: the dummytracker domains are exempt from ordering and remain at the top of the list file.

Sort the canonicalized domains alphanumerically before writing their hashes to the safebrowsing list file. Replace the previous domains set with a previous_domain string. Finally, adjust the tests to the changes and make sure that process_plugin_blocklist tests check both the order of hashes and the existence of duplicates.

Sort the canonicalized URLs of each entity alphanumerically before writing their hashes to the safebrowsing list file. URLs are still sorted based on the entity name and this commit introduces a secondary ordering within each entity. Use a set to store canonicalized URLs before sorting them, so that duplicates are eliminated. Finally, make sure that process_entitylist tests check both the order of hashes and the existence of duplicates.

Add unit tests for publish2cloud.py

Order domains alphanumerically in safebrowsing lists

Make sure to encode string objects to bytes and decode byte objects to strings wherever necessary, as a preparatory step for transitioning to Python 3. More specifically: * Encode URLs before hashing them, as hashlib.sha256() expects a bytes object in Python 3 * Convert chunk headers to bytes before writing them to a binary file * Decode data right after receiving it from urlopen * Decode the content of HTTP responses before printing it * Decode data right after reading it from a binary file * Encode the data used to mock urlopen in tests Also, fix a minor bug by using the 'b' prefix at the start of each line of a multi-line bytes literal instead of only using it for the first line. Finally, omit the "utf-8" argument when encoding, as UTF-8 is the default encoding.

Apply the following changes to make the list creation script compatible with Python 3. After these changes it will no longer support Python 2. * Import configparser instead of ConfigParser, as this module was renamed in Python 3 * Import quote and unquote from urllib.parse and urlopen from urllib.request instead of importing urllib2 * Remove mock from the dependencies as it is part of the standard library in Python 3 * Pin dependencies to the latest version * Use %d instead of %s when formatting a binary chunk header to avoid getting a TypeError. Also, use %d instead of %u, which is deprecated * Set the "version" key equal to the empty string instead of None when reverting configuration to avoid getting a "TypeError: option values must be strings". * Update CircleCI configuration to use a docker image with Python 3.8 * Update Python version requirement in README * Update README instructions to use virtualenv with Python 3 Closes #108.

Use config.read_file() to read configuration from a file instead of config.readfp(), which is deprecated. Also, use ConfigParser instead of SafeConfigParser, which is an alias that will be removed in future Python versions.

Port script to Python 3

Separate test requirements

Update the instructions in the README to use the new test requirements file. Also, update the dependency caching key in the configuration of CircleCI, so that it includes the checksum of the test requirements file as well.

Update README and CircleCI config to use requirements-test.txt

groovecoder

code & logic LGTM.

boolean5 and others added 30 commits July 29, 2020 12:35

Test write_safebrowsing_blocklist without file

c8de365

Implement a test to check that write_safebrowsing_blocklist works as expected even when no output file is provided. Add a comment to explain when this happens.

Add unit tests for get_entity_lists

635cbd7

Implement unit tests for get_entity_lists. Include a testcase where the version is smaller than the version in which large entities separation started, as well as testcases where large entities are separated.

Move plugin list creation into separate function

2e874c3

Move the plugin list creation logic from main to a separate get_plugin_lists function, in order to be consistent with get_tracker_lists and get_entity_lists and to facilitate testing.

Raise error on empty plugin blocklist URL

54aa5f0

Raise a ValueError if the 'blocklist' key of a plugin configuration section is empty. Previously, this would lead to the creation of an empty plugin blocklist.

Add unit tests for get_plugin_lists

d75a222

Merge pull request #142 from boolean5/test-write-safebrowsing-blocklist

9edf4e3

Add unit tests for blocklist and entity list generation

Add tests for get_domains_from_category_filters

4b90e4d

Add a unit test that validates the implementation of get_domains_from_category_filters and another one that confirms that it raises a ValueError when receiving invalid input. Also, add a sample blocklist JSON file to be used in the tests.

Add unit tests for get_domains_from_filters

39c2f70

Add domains with tags to sample_blocklist.json in order to use them in tests. Implement a simple test just with category filters, a test with category exclusion filters and a test with the "performance" tag.

Merge pull request #143 from boolean5/test-get-domains-from-filters

12431fb

Add unit tests for domain filtering and get_tracker_lists

Add unit tests for get_output_and_log_files

69d8a35

Add unit tests for get_list_url

a256be4

Merge pull request #154 from boolean5/various-tests

2dedaa5

Add more unit tests

Handle requests.exceptions.ReadTimeout

c1e49c8

Merge pull request #164 from mozilla-services/handle-ReadTimeout-on-R…

d188162

…emoteSettings Handle requests.exceptions.ReadTimeout

Rename fingerprinting category

6eac34b

According to changes on mozilla-services/shavar-prod-lists#284 we are now using "FingerprintInvasive" instead of "Fingerprint" category

Update tests with new fingerprinting category

1197867

Rename the Fingerprinting category to FingerprintingInvasive in the tests and in sample_blocklist.json. Also, update an example in the docstring of get_domains_from_filters with the new category name.

Fix bug in chunk_metadata

75ee68b

Return the checksum of the contents of the list file starting right after the chunk header. Previously, we were always starting from the 26th byte, regardless of the size of the header.

Add unit test for chunk_metadata

da3abf4

Fix flake8 error in publish2cloud.py

07ee9e8

Add unit tests for new_data_to_publish_to_s3

70952a0

Add moto as a requirement and implement unit tests for new_data_to_publish_to_s3.

Add unit tests for publish_to_s3

a5abbc9

Test S3 key permissions

028068b

Add unit tests to check that the expected S3 key permissions are set in new_data_to_publish_to_s3 and publish_to_s3. These tests are currently expected to fail because setting key permissions with add_user_grant() does not work in moto.

Merge pull request #165 from mozilla-services/rename-fingerprinting-c…

d9cb212

…ategory-on-sample-config Update sample configuration with new fingerprinting category name

boolean5 and others added 16 commits August 17, 2020 21:41

Fix minor typos in comments

b97a321

Nit-picky code style

27db63b

Merge pull request #162 from boolean5/add-tests-for-p2c

2f0ae8d

Add unit tests for publish2cloud.py

Merge pull request #166 from boolean5/order-domains

0c8e0d2

Order domains alphanumerically in safebrowsing lists

Remove setuptools dependency

e514094

Fix configparser deprecation warnings

a1a55f1

Use config.read_file() to read configuration from a file instead of config.readfp(), which is deprecated. Also, use ConfigParser instead of SafeConfigParser, which is an alias that will be removed in future Python versions.

Merge pull request #163 from boolean5/py3-upgrade

d7e0111

Port script to Python 3

Separate test requirements and update config.yml

59cc098

Merge pull request #168 from mozilla-services/separate-test-requirements

8bf1675

Separate test requirements

Use the new requirements file everywhere

89e9f94

Update the instructions in the README to use the new test requirements file. Also, update the dependency caching key in the configuration of CircleCI, so that it includes the checksum of the test requirements file as well.

Merge pull request #169 from boolean5/requirements-file-update

c702ebe

Update README and CircleCI config to use requirements-test.txt

Hush logs

1ed9841

Write logs

4affffc

say-yawn requested a review from groovecoder November 6, 2020 16:07

groovecoder approved these changes Nov 6, 2020

View reviewed changes

say-yawn merged commit c62abc4 into main Nov 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Main to Py3 #171

Update Main to Py3 #171

say-yawn commented Nov 6, 2020 •

edited

Loading

groovecoder left a comment

Update Main to Py3 #171

Update Main to Py3 #171

Conversation

say-yawn commented Nov 6, 2020 • edited Loading

About this PR

groovecoder left a comment

Choose a reason for hiding this comment

say-yawn commented Nov 6, 2020 •

edited

Loading