Skip to content
This repository has been archived by the owner on May 1, 2024. It is now read-only.

Python 2/3 compatibility for finance tasks #747

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

pwnage101
Copy link
Contributor

@pwnage101 pwnage101 commented Jul 17, 2019

This does not change any infrastructure to actually run anything under
python 3, it only represents the results of my testing of the finance
unit tests and acceptance tests under python 3 and modernizing the code
to become both python 2 and 3 compatible.

Maybe half of the changes were caused by python-modernize (and manually audited), the other half are manual changes.


Testing requirements

The following four suite of tests are needed to pass before merging this PR:

suite status
Python 2 unit tests Passing.
Python 2 acceptance tests Failed: build #1795
Python 3 unit tests (financial tasks only) Passing.
Python 3 acceptence tests (test_financial_reports only) In Progress

Analytics Pipeline Pull Request

Make sure that the following steps are done before merging:

  • If you have a migration please contact data engineering team before merging.
  • Before merging run full acceptance tests suite and provide URL for the acceptance tests run.
  • A member of data engineering team has approved the pull request.


# When pip-compile is run under python 2, it omits all packages with a python 3
# condition. Re-add them here, pre-pinned.
ujson==1.35 ; python_version > "2.7"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct thing to do, instead of using extra.txt, is to compile two completely separate sets of requirements txt files, one for python 2, and one for python 3. However, I will consider that out of scope for this PR, and this works okay for now, if a little fragile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edx-platform has a similar file called constraints.txt. Is this that different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a different model. constraints.txt is fed into pip-compile, whereas our extra.txt is usually installed after default.txt by makefiles and other code in the production workflow. It's actually coincidental and lucky that we already have extra.txt, which makes it convenient for me to force install some package.

@@ -1,2 +1,2 @@
pip==9.0.1
pip==19.1.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required for package install conditionals (e.g. python_version <= "2.7")

@@ -3,7 +3,7 @@
# Workaround for https://github.com/ansible/ansible/issues/8875
--no-binary ansible

ansible==1.4.5 # GPL v3 License
ansible<2.9.0 # GPL v3 License
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for invoking ansible with python 3.

@pwnage101
Copy link
Contributor Author

pwnage101 commented Jul 17, 2019

Here's the current python 3 luigi execution summary:

20:40:00 ===== Luigi Execution Summary =====
20:40:00 
20:40:00 Scheduled 65 tasks of which:
20:40:00 * 3 present dependencies were encountered:
20:40:00     - 2 ExternalURL(url=s3+https://edx-analytics-test-data/credentials/acceptance-db.json,s3+https://edx-analytics-test-data/credentials/acceptance-vertica.json)
20:40:00     - 1 PaymentTask(...)
20:40:00 * 51 ran successfully:
20:40:00     - 1 FullOttoOrderTableTask(...)
20:40:00     - 1 FullShoppingcartOrderTableTask(...)
20:40:00     - 1 ImportAuthUserTask(...)
20:40:00     - 1 ImportCouponVoucherIndirectionState(...)
20:40:00     - 1 ImportCouponVoucherState(...)
20:40:00     ...
20:40:00 * 3 failed:
20:40:00     - 1 LoadInternalReportingFullShoppingcartOrdersToWarehouse(...)
20:40:00     - 1 OrderTableTask(...)
20:40:00     - 1 ReconcileFullOrdersAndTransactionsTask(...)
20:40:00 * 8 were left pending, among these:
20:40:00     * 6 had failed dependencies:
20:40:00         - 1 BuildEdServicesReportTask(...)
20:40:00         - 1 BuildFinancialReportsTask(...)
20:40:00         - 1 LoadInternalReportingEdServicesReportToWarehouse(...)
20:40:00         - 1 LoadInternalReportingFullOrderTransactionsToWarehouse(...)
20:40:00         - 1 ReconcileOrdersAndTransactionsTask(...)
20:40:00         ...
20:40:00     * 2 was not granted run permission by the scheduler:
20:40:00         - 1 LoadInternalReportingOrderTransactionsToWarehouse(...)
20:40:00         - 1 TransactionReportTask(...)

Copy link
Contributor

@brianhw brianhw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes look benign. I'm mostly concerned about getting different results due to changes in the json parser -- or at least understanding what differences are introduced. Perhaps we don't have to run all production jobs against this branch, but I think that it would be good to check something run in python2 that uses events. And running finance-reports on python3, of course.

edx/analytics/tasks/util/eventlog.py Outdated Show resolved Hide resolved
if ujson:
return ujson.loads(obj)
elif cjson:
return cjson.decode(obj, all_unicode=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all original calls specified all_unicode=True, especially the event loading. Would adding this option here have an impact on how we read events in Python2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_unicode supposedly only affects output, not parsing. Also, even if all_unicode=False, the output may contain unicode objects anyway wherever there were unicode characters in the input JSON. The pipeline code necessarily already needed to be able to handle unicode strings in the decoded python object, so forcing the whole thing to be unicode should be low-risk.

https://github.com/AGProjects/python-cjson/blob/master/cjson.c#L1166-L1171

edx/analytics/tasks/util/tests/test_retry.py Show resolved Hide resolved
@@ -677,12 +678,20 @@ def format_transaction_table_output(self, audit_code, transaction, orderitem, tr
orderitem.partner_short_code if orderitem else self.default_partner_short_code,
orderitem.payment_ref_id if orderitem else transaction.payment_ref_id,
orderitem.order_id if orderitem else None,
encode_id(orderitem.order_processor, "order_id", orderitem.order_id) if orderitem else None,
encode_id(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to confirm that the encode_id results match what is currently produced in tables. We should probably run the finance-report in dev and spot-check the actual output.

@@ -18,9 +18,10 @@ html5lib==1.0b3 # MIT
isoweek==1.3.3 # BSD
numpy==1.11.3 # BSD
paypalrestsdk==1.9.0 # Paypal SDK License
psycopg2==2.6.2 # LGPL
psycopg2 # LGPL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this also required for python3? Not sure what this is actually used for -- Vertica at the very least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this was just so that I can run pip-compile on my local Linux machine, it probably would work as-is with a Mac python interpreter. I never really figured out why this is the case. Regardless, .in files generally shouldn't have such strict constraints, right?

idna==2.6 # via requests, snowflake-connector-python
ijson==2.3 # via snowflake-connector-python
ijson==2.4 # via snowflake-connector-python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is ijson? If we're now pulling this in, can we use it instead of cjson? Or is it just not fast enough?


# When pip-compile is run under python 2, it omits all packages with a python 3
# condition. Re-add them here, pre-pinned.
ujson==1.35 ; python_version > "2.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edx-platform has a similar file called constraints.txt. Is this that different?

Provide an abstraction layer for fast json implementations across python 2 and 3.
"""
try:
import ujson
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious also about the choice of ujson. I came across an article (https://pythonspeed.com/articles/faster-json-library/) that noted in 4/2019 that "ujson has a number of bugs filed regarding crashes, and even those crashes that have been fixed aren’t always available because there hasn’t been a release since 2016." That particular article considered instead rapidjson and orjson as candidates, with a use case of encoding relatively short messages. Our application is the opposite, to decode relatively short messages (avg length 1K), so we might have a different result.

Hmm. Looking at https://pypi.org/project/orjson/, it says that while its encoding speeds are faster than others, the decoding speed is about the same to 2x the standard. And it doesn't mention Python2 support (though why would anyone anymore?). Rapidjson is apparently another C library with Python bindings, and its performance is documented here: https://python-rapidjson.readthedocs.io/en/latest/benchmarks.html. It suggests that for our deserialization task, all the packages are probably comparable, except in the case of significant Unicode strings, where simplejson and stdlib json both slow way down. So those results suggest that ujson is a fine enough choice.

Copy link
Contributor Author

@pwnage101 pwnage101 Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't originally intending on implementing something like FastJson, but the following two things compelled me to write some basic shell of an implementation:

  1. Luigi, via stevedore plugins, kept importing all the python files in the repo, despite not actually wanting to use any tasks that are not required for financial reporting.
  2. cjson was uninstallable in a python 3 virtualenv, and possibly also vice versa (ujson + python 2)

We don't need to evaluate the json part of this PR any further if it does not block the proper functioning of the financial reporting tasks.

@pwnage101 pwnage101 force-pushed the pwnage101/de-1589-py3-in-finance-tasks branch 2 times, most recently from 6193150 to e1b528a Compare July 26, 2019 13:48
This does not change any infrastructure to actually run anything under
python 3, it only represents the results of my testing of the finance
unit tests and acceptance tests under python 3 and modernizing the code
to become both python 2 and 3 compatible.
@pwnage101 pwnage101 force-pushed the pwnage101/de-1589-py3-in-finance-tasks branch 2 times, most recently from daff797 to 3daf46d Compare July 26, 2019 14:57
@pwnage101 pwnage101 force-pushed the pwnage101/de-1589-py3-in-finance-tasks branch from 3daf46d to 8d5f492 Compare July 26, 2019 15:00
brianhw and others added 4 commits July 26, 2019 12:31
Change default for now to python 3.6, for testing.

Also cast INT to STRING before UNION in Hive code.

Pass 'future' package to all remote nodes.

Fix creation of Vertica marker tables in Python 3.

Convert OrderTransactionRecord from bytes.
@@ -56,20 +56,48 @@ upgrade: ## update the requirements/*.txt files with the latest packages satisfy
CUSTOM_COMPILE_COMMAND="make upgrade" pip-compile --upgrade -o requirements/docs.txt requirements/docs.in
CUSTOM_COMPILE_COMMAND="make upgrade" pip-compile --upgrade -o requirements/test.txt requirements/test.in

test-docker-local:
docker run --rm -u root -v `(pwd)`:/edx/app/analytics_pipeline/analytics_pipeline -it edxops/analytics_pipeline:latest make develop-local test-local

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I deleted this make target because I couldn't figure out what it's useful for. It isn't programmatically called, nor referenced by any documentation that I could find. Since it doesn't setup requirements via make system-requirements reset-virtualenv test-requirements, I can't imagine that it works at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the target is precisely to run tests without performing setup. Again. When trying to do development or debugging on unit tests where the requirements aren't changing but only code, it suffices to run test-docker once, and then run test-docker-local on subsequent runs. It saves the time involved in trying to re-install the requirements each time. Same motivation for many of the other "local" targets.

@@ -93,15 +100,20 @@ def main():
# Tell luigi what dependencies to pass to the Hadoop nodes:
# - edx.analytics.tasks is used to load the pipeline code, since we cannot trust all will be loaded automatically.
# - boto is used for all direct interactions with s3.
# - cjson is used for all parsing event logs.
# - filechunkio is used for multipart uploads of large files to s3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that filechunkio is not used anymore, so it seemed a good time to remove it from here for starters.

@@ -119,6 +121,8 @@ def run_task_playbook(inventory, arguments, uid):
if arguments.workflow_profiler:
env_vars['WORKFLOW_PROFILER'] = arguments.workflow_profiler
env_vars['WORKFLOW_PROFILER_PATH'] = log_dir
if arguments.python_version:
env_vars['HADOOP_PYTHON_EXECUTABLE'] = arguments.python_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the purpose of this way of passing python_version. What is the environment variable being used for? Anything other than acceptance tests?

@@ -206,7 +208,12 @@ def setUp(self):
elasticsearch_alias = 'alias_test_' + self.identifier
self.warehouse_path = url_path_join(self.test_root, 'warehouse')
self.edx_rest_api_cache_root = url_path_join(self.test_src, 'edx-rest-api-cache')
# Use config directly, rather than os.getenv('HADOOP_PYTHON_EXECUTABLE', '/usr/bin/python')
python_executable = self.config.get('python_version', '/usr/bin/python')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this code to not pull from the environment variable, because it was circular (at least on Jenkins). The executable is being passed into the acceptance tests explicitly, and needs to be loaded into the config file the tests use, so we just do that. The acceptance tests then call remote-task with the resulting config file (in a temp file location) as an arg.

@@ -146,7 +146,8 @@ def create_marker_table(self):
""".format(marker_schema=self.marker_schema, marker_table=self.marker_table)
)
except vertica_python.errors.QueryError as err:
if 'Sqlstate: 42710' in err.args[0]: # This Sqlstate will appear if the marker table already exists.
# Sqlstate 42710 will appear if the marker table already exists.
if 'Sqlstate:' in err.args[0] and '42710' in err.args[0]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message in Python3 is slightly different: b'Object "table_updates" already exists', Sqlstate: b'42710', So this is a way of matching against Py2 or Py3 forms.

@@ -298,7 +299,7 @@ def insert_query(self):
-- the complete line item quantity and amount
IF(oi.status = 'refunded', oi.qty * oi.unit_cost, NULL) AS refunded_amount,
IF(oi.status = 'refunded', oi.qty, NULL) AS refunded_quantity,
oi.order_id AS payment_ref_id,
CAST(oi.order_id AS STRING) AS payment_ref_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the next CAST are required to work with the updated version of Hive on EMR 5.14 and EMR 5.24, among the versions tested (back in the day).

boto==2.48.0 # MIT
ecdsa==0.13 # MIT
Jinja2 # BSD
pycrypto==2.6.1 # public domain
wheel==0.30.0
future # MIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a comment about this, but we might want to revisit the freezing of six==1.10.0 below. It may not be necessary any more.

python-gnupg==0.3.9 # BSD
pytz==2017.3 # ZPL
requests==2.18.4 # Apache 2.0
six==1.10.0 # MIT
six # MIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is set to 1.10.0 in base.txt as well.

@@ -4,7 +4,7 @@
-r base.txt

argparse==1.2.1 # Python Software Foundation License
boto3==1.4.8 # Apache 2.0
boto3 # Apache 2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder we should remove filechunkio.

@brianhw brianhw force-pushed the pwnage101/de-1589-py3-in-finance-tasks branch from c82f9f5 to abd2091 Compare July 29, 2019 21:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants