-
Notifications
You must be signed in to change notification settings - Fork 116
Python 2/3 compatibility for finance tasks #747
base: master
Are you sure you want to change the base?
Conversation
|
||
# When pip-compile is run under python 2, it omits all packages with a python 3 | ||
# condition. Re-add them here, pre-pinned. | ||
ujson==1.35 ; python_version > "2.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct thing to do, instead of using extra.txt, is to compile two completely separate sets of requirements txt files, one for python 2, and one for python 3. However, I will consider that out of scope for this PR, and this works okay for now, if a little fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edx-platform has a similar file called constraints.txt. Is this that different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a different model. constraints.txt is fed into pip-compile, whereas our extra.txt is usually installed after default.txt by makefiles and other code in the production workflow. It's actually coincidental and lucky that we already have extra.txt, which makes it convenient for me to force install some package.
@@ -1,2 +1,2 @@ | |||
pip==9.0.1 | |||
pip==19.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required for package install conditionals (e.g. python_version <= "2.7"
)
@@ -3,7 +3,7 @@ | |||
# Workaround for https://github.com/ansible/ansible/issues/8875 | |||
--no-binary ansible | |||
|
|||
ansible==1.4.5 # GPL v3 License | |||
ansible<2.9.0 # GPL v3 License |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required for invoking ansible with python 3.
Here's the current python 3 luigi execution summary:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the changes look benign. I'm mostly concerned about getting different results due to changes in the json parser -- or at least understanding what differences are introduced. Perhaps we don't have to run all production jobs against this branch, but I think that it would be good to check something run in python2 that uses events. And running finance-reports on python3, of course.
if ujson: | ||
return ujson.loads(obj) | ||
elif cjson: | ||
return cjson.decode(obj, all_unicode=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all original calls specified all_unicode=True, especially the event loading. Would adding this option here have an impact on how we read events in Python2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_unicode
supposedly only affects output, not parsing. Also, even if all_unicode=False
, the output may contain unicode objects anyway wherever there were unicode characters in the input JSON. The pipeline code necessarily already needed to be able to handle unicode strings in the decoded python object, so forcing the whole thing to be unicode should be low-risk.
https://github.com/AGProjects/python-cjson/blob/master/cjson.c#L1166-L1171
@@ -677,12 +678,20 @@ def format_transaction_table_output(self, audit_code, transaction, orderitem, tr | |||
orderitem.partner_short_code if orderitem else self.default_partner_short_code, | |||
orderitem.payment_ref_id if orderitem else transaction.payment_ref_id, | |||
orderitem.order_id if orderitem else None, | |||
encode_id(orderitem.order_processor, "order_id", orderitem.order_id) if orderitem else None, | |||
encode_id( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to confirm that the encode_id results match what is currently produced in tables. We should probably run the finance-report in dev and spot-check the actual output.
@@ -18,9 +18,10 @@ html5lib==1.0b3 # MIT | |||
isoweek==1.3.3 # BSD | |||
numpy==1.11.3 # BSD | |||
paypalrestsdk==1.9.0 # Paypal SDK License | |||
psycopg2==2.6.2 # LGPL | |||
psycopg2 # LGPL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this also required for python3? Not sure what this is actually used for -- Vertica at the very least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, this was just so that I can run pip-compile on my local Linux machine, it probably would work as-is with a Mac python interpreter. I never really figured out why this is the case. Regardless, .in files generally shouldn't have such strict constraints, right?
idna==2.6 # via requests, snowflake-connector-python | ||
ijson==2.3 # via snowflake-connector-python | ||
ijson==2.4 # via snowflake-connector-python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is ijson? If we're now pulling this in, can we use it instead of cjson? Or is it just not fast enough?
|
||
# When pip-compile is run under python 2, it omits all packages with a python 3 | ||
# condition. Re-add them here, pre-pinned. | ||
ujson==1.35 ; python_version > "2.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edx-platform has a similar file called constraints.txt. Is this that different?
Provide an abstraction layer for fast json implementations across python 2 and 3. | ||
""" | ||
try: | ||
import ujson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious also about the choice of ujson. I came across an article (https://pythonspeed.com/articles/faster-json-library/) that noted in 4/2019 that "ujson has a number of bugs filed regarding crashes, and even those crashes that have been fixed aren’t always available because there hasn’t been a release since 2016." That particular article considered instead rapidjson and orjson as candidates, with a use case of encoding relatively short messages. Our application is the opposite, to decode relatively short messages (avg length 1K), so we might have a different result.
Hmm. Looking at https://pypi.org/project/orjson/, it says that while its encoding speeds are faster than others, the decoding speed is about the same to 2x the standard. And it doesn't mention Python2 support (though why would anyone anymore?). Rapidjson is apparently another C library with Python bindings, and its performance is documented here: https://python-rapidjson.readthedocs.io/en/latest/benchmarks.html. It suggests that for our deserialization task, all the packages are probably comparable, except in the case of significant Unicode strings, where simplejson and stdlib json both slow way down. So those results suggest that ujson is a fine enough choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't originally intending on implementing something like FastJson, but the following two things compelled me to write some basic shell of an implementation:
- Luigi, via stevedore plugins, kept importing all the python files in the repo, despite not actually wanting to use any tasks that are not required for financial reporting.
- cjson was uninstallable in a python 3 virtualenv, and possibly also vice versa (ujson + python 2)
We don't need to evaluate the json part of this PR any further if it does not block the proper functioning of the financial reporting tasks.
6193150
to
e1b528a
Compare
This does not change any infrastructure to actually run anything under python 3, it only represents the results of my testing of the finance unit tests and acceptance tests under python 3 and modernizing the code to become both python 2 and 3 compatible.
daff797
to
3daf46d
Compare
3daf46d
to
8d5f492
Compare
Change default for now to python 3.6, for testing. Also cast INT to STRING before UNION in Hive code. Pass 'future' package to all remote nodes. Fix creation of Vertica marker tables in Python 3. Convert OrderTransactionRecord from bytes.
@@ -56,20 +56,48 @@ upgrade: ## update the requirements/*.txt files with the latest packages satisfy | |||
CUSTOM_COMPILE_COMMAND="make upgrade" pip-compile --upgrade -o requirements/docs.txt requirements/docs.in | |||
CUSTOM_COMPILE_COMMAND="make upgrade" pip-compile --upgrade -o requirements/test.txt requirements/test.in | |||
|
|||
test-docker-local: | |||
docker run --rm -u root -v `(pwd)`:/edx/app/analytics_pipeline/analytics_pipeline -it edxops/analytics_pipeline:latest make develop-local test-local | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I deleted this make target because I couldn't figure out what it's useful for. It isn't programmatically called, nor referenced by any documentation that I could find. Since it doesn't setup requirements via make system-requirements reset-virtualenv test-requirements
, I can't imagine that it works at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the target is precisely to run tests without performing setup. Again. When trying to do development or debugging on unit tests where the requirements aren't changing but only code, it suffices to run test-docker once, and then run test-docker-local on subsequent runs. It saves the time involved in trying to re-install the requirements each time. Same motivation for many of the other "local" targets.
@@ -93,15 +100,20 @@ def main(): | |||
# Tell luigi what dependencies to pass to the Hadoop nodes: | |||
# - edx.analytics.tasks is used to load the pipeline code, since we cannot trust all will be loaded automatically. | |||
# - boto is used for all direct interactions with s3. | |||
# - cjson is used for all parsing event logs. | |||
# - filechunkio is used for multipart uploads of large files to s3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that filechunkio is not used anymore, so it seemed a good time to remove it from here for starters.
@@ -119,6 +121,8 @@ def run_task_playbook(inventory, arguments, uid): | |||
if arguments.workflow_profiler: | |||
env_vars['WORKFLOW_PROFILER'] = arguments.workflow_profiler | |||
env_vars['WORKFLOW_PROFILER_PATH'] = log_dir | |||
if arguments.python_version: | |||
env_vars['HADOOP_PYTHON_EXECUTABLE'] = arguments.python_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand the purpose of this way of passing python_version. What is the environment variable being used for? Anything other than acceptance tests?
@@ -206,7 +208,12 @@ def setUp(self): | |||
elasticsearch_alias = 'alias_test_' + self.identifier | |||
self.warehouse_path = url_path_join(self.test_root, 'warehouse') | |||
self.edx_rest_api_cache_root = url_path_join(self.test_src, 'edx-rest-api-cache') | |||
# Use config directly, rather than os.getenv('HADOOP_PYTHON_EXECUTABLE', '/usr/bin/python') | |||
python_executable = self.config.get('python_version', '/usr/bin/python') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this code to not pull from the environment variable, because it was circular (at least on Jenkins). The executable is being passed into the acceptance tests explicitly, and needs to be loaded into the config file the tests use, so we just do that. The acceptance tests then call remote-task with the resulting config file (in a temp file location) as an arg.
@@ -146,7 +146,8 @@ def create_marker_table(self): | |||
""".format(marker_schema=self.marker_schema, marker_table=self.marker_table) | |||
) | |||
except vertica_python.errors.QueryError as err: | |||
if 'Sqlstate: 42710' in err.args[0]: # This Sqlstate will appear if the marker table already exists. | |||
# Sqlstate 42710 will appear if the marker table already exists. | |||
if 'Sqlstate:' in err.args[0] and '42710' in err.args[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message in Python3 is slightly different: b'Object "table_updates" already exists', Sqlstate: b'42710', So this is a way of matching against Py2 or Py3 forms.
@@ -298,7 +299,7 @@ def insert_query(self): | |||
-- the complete line item quantity and amount | |||
IF(oi.status = 'refunded', oi.qty * oi.unit_cost, NULL) AS refunded_amount, | |||
IF(oi.status = 'refunded', oi.qty, NULL) AS refunded_quantity, | |||
oi.order_id AS payment_ref_id, | |||
CAST(oi.order_id AS STRING) AS payment_ref_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the next CAST are required to work with the updated version of Hive on EMR 5.14 and EMR 5.24, among the versions tested (back in the day).
boto==2.48.0 # MIT | ||
ecdsa==0.13 # MIT | ||
Jinja2 # BSD | ||
pycrypto==2.6.1 # public domain | ||
wheel==0.30.0 | ||
future # MIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a comment about this, but we might want to revisit the freezing of six==1.10.0 below. It may not be necessary any more.
python-gnupg==0.3.9 # BSD | ||
pytz==2017.3 # ZPL | ||
requests==2.18.4 # Apache 2.0 | ||
six==1.10.0 # MIT | ||
six # MIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is set to 1.10.0 in base.txt as well.
@@ -4,7 +4,7 @@ | |||
-r base.txt | |||
|
|||
argparse==1.2.1 # Python Software Foundation License | |||
boto3==1.4.8 # Apache 2.0 | |||
boto3 # Apache 2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder we should remove filechunkio.
c82f9f5
to
abd2091
Compare
Cybersource: Replaced deprecated servlets method with REST API.
This does not change any infrastructure to actually run anything under
python 3, it only represents the results of my testing of the finance
unit tests and acceptance tests under python 3 and modernizing the code
to become both python 2 and 3 compatible.
Maybe half of the changes were caused by python-modernize (and manually audited), the other half are manual changes.
Testing requirements
The following four suite of tests are needed to pass before merging this PR:
Analytics Pipeline Pull Request
Make sure that the following steps are done before merging: