Python 2/3 compatibility for finance tasks #747

pwnage101 · 2019-07-17T19:54:15Z

This does not change any infrastructure to actually run anything under
python 3, it only represents the results of my testing of the finance
unit tests and acceptance tests under python 3 and modernizing the code
to become both python 2 and 3 compatible.

Maybe half of the changes were caused by python-modernize (and manually audited), the other half are manual changes.

Testing requirements

The following four suite of tests are needed to pass before merging this PR:

suite	status
Python 2 unit tests	Passing.
Python 2 acceptance tests	Failed: build #1795
Python 3 unit tests (financial tasks only)	Passing.
Python 3 acceptence tests (test_financial_reports only)	In Progress

Analytics Pipeline Pull Request

Make sure that the following steps are done before merging:

If you have a migration please contact data engineering team before merging.
Before merging run full acceptance tests suite and provide URL for the acceptance tests run.
A member of data engineering team has approved the pull request.

pwnage101 · 2019-07-17T20:01:18Z

requirements/extra.txt

+
+# When pip-compile is run under python 2, it omits all packages with a python 3
+# condition.  Re-add them here, pre-pinned.
+ujson==1.35 ; python_version > "2.7"


The correct thing to do, instead of using extra.txt, is to compile two completely separate sets of requirements txt files, one for python 2, and one for python 3. However, I will consider that out of scope for this PR, and this works okay for now, if a little fragile.

edx-platform has a similar file called constraints.txt. Is this that different?

That is a different model. constraints.txt is fed into pip-compile, whereas our extra.txt is usually installed after default.txt by makefiles and other code in the production workflow. It's actually coincidental and lucky that we already have extra.txt, which makes it convenient for me to force install some package.

pwnage101 · 2019-07-17T20:02:45Z

requirements/pip.txt

@@ -1,2 +1,2 @@
-pip==9.0.1
+pip==19.1.1


Required for package install conditionals (e.g. python_version <= "2.7")

pwnage101 · 2019-07-17T20:03:24Z

requirements/base.in

@@ -3,7 +3,7 @@
 # Workaround for https://github.com/ansible/ansible/issues/8875
 --no-binary ansible

-ansible==1.4.5          # GPL v3 License
+ansible<2.9.0           # GPL v3 License


This is required for invoking ansible with python 3.

pwnage101 · 2019-07-17T20:42:20Z

Here's the current python 3 luigi execution summary:

20:40:00 ===== Luigi Execution Summary =====
20:40:00 
20:40:00 Scheduled 65 tasks of which:
20:40:00 * 3 present dependencies were encountered:
20:40:00     - 2 ExternalURL(url=s3+https://edx-analytics-test-data/credentials/acceptance-db.json,s3+https://edx-analytics-test-data/credentials/acceptance-vertica.json)
20:40:00     - 1 PaymentTask(...)
20:40:00 * 51 ran successfully:
20:40:00     - 1 FullOttoOrderTableTask(...)
20:40:00     - 1 FullShoppingcartOrderTableTask(...)
20:40:00     - 1 ImportAuthUserTask(...)
20:40:00     - 1 ImportCouponVoucherIndirectionState(...)
20:40:00     - 1 ImportCouponVoucherState(...)
20:40:00     ...
20:40:00 * 3 failed:
20:40:00     - 1 LoadInternalReportingFullShoppingcartOrdersToWarehouse(...)
20:40:00     - 1 OrderTableTask(...)
20:40:00     - 1 ReconcileFullOrdersAndTransactionsTask(...)
20:40:00 * 8 were left pending, among these:
20:40:00     * 6 had failed dependencies:
20:40:00         - 1 BuildEdServicesReportTask(...)
20:40:00         - 1 BuildFinancialReportsTask(...)
20:40:00         - 1 LoadInternalReportingEdServicesReportToWarehouse(...)
20:40:00         - 1 LoadInternalReportingFullOrderTransactionsToWarehouse(...)
20:40:00         - 1 ReconcileOrdersAndTransactionsTask(...)
20:40:00         ...
20:40:00     * 2 was not granted run permission by the scheduler:
20:40:00         - 1 LoadInternalReportingOrderTransactionsToWarehouse(...)
20:40:00         - 1 TransactionReportTask(...)

brianhw

Most of the changes look benign. I'm mostly concerned about getting different results due to changes in the json parser -- or at least understanding what differences are introduced. Perhaps we don't have to run all production jobs against this branch, but I think that it would be good to check something run in python2 that uses events. And running finance-reports on python3, of course.

edx/analytics/tasks/util/eventlog.py

brianhw · 2019-07-18T15:06:10Z

edx/analytics/tasks/util/fast_json.py

+        if ujson:
+            return ujson.loads(obj)
+        elif cjson:
+            return cjson.decode(obj, all_unicode=True)


Not all original calls specified all_unicode=True, especially the event loading. Would adding this option here have an impact on how we read events in Python2?

all_unicode supposedly only affects output, not parsing. Also, even if all_unicode=False, the output may contain unicode objects anyway wherever there were unicode characters in the input JSON. The pipeline code necessarily already needed to be able to handle unicode strings in the decoded python object, so forcing the whole thing to be unicode should be low-risk.

https://github.com/AGProjects/python-cjson/blob/master/cjson.c#L1166-L1171

edx/analytics/tasks/util/tests/test_retry.py

brianhw · 2019-07-18T16:29:55Z

edx/analytics/tasks/warehouse/financial/reconcile.py

@@ -677,12 +678,20 @@ def format_transaction_table_output(self, audit_code, transaction, orderitem, tr
            orderitem.partner_short_code if orderitem else self.default_partner_short_code,
            orderitem.payment_ref_id if orderitem else transaction.payment_ref_id,
            orderitem.order_id if orderitem else None,
-            encode_id(orderitem.order_processor, "order_id", orderitem.order_id) if orderitem else None,
+            encode_id(


It might be good to confirm that the encode_id results match what is currently produced in tables. We should probably run the finance-report in dev and spot-check the actual output.

brianhw · 2019-07-18T16:31:35Z

requirements/default.in

@@ -18,9 +18,10 @@ html5lib==1.0b3 	# MIT
 isoweek==1.3.3		# BSD
 numpy==1.11.3 		# BSD
 paypalrestsdk==1.9.0    # Paypal SDK License
-psycopg2==2.6.2     # LGPL
+psycopg2                # LGPL


Was this also required for python3? Not sure what this is actually used for -- Vertica at the very least?

oh, this was just so that I can run pip-compile on my local Linux machine, it probably would work as-is with a Mac python interpreter. I never really figured out why this is the case. Regardless, .in files generally shouldn't have such strict constraints, right?

brianhw · 2019-07-18T16:32:45Z

requirements/default.txt

 idna==2.6                 # via requests, snowflake-connector-python
-ijson==2.3                # via snowflake-connector-python
+ijson==2.4                # via snowflake-connector-python


What is ijson? If we're now pulling this in, can we use it instead of cjson? Or is it just not fast enough?

brianhw · 2019-07-18T16:34:30Z

requirements/extra.txt

+
+# When pip-compile is run under python 2, it omits all packages with a python 3
+# condition.  Re-add them here, pre-pinned.
+ujson==1.35 ; python_version > "2.7"


edx-platform has a similar file called constraints.txt. Is this that different?

brianhw · 2019-07-19T01:55:27Z

edx/analytics/tasks/util/fast_json.py

+Provide an abstraction layer for fast json implementations across python 2 and 3.
+"""
+try:
+    import ujson


Curious also about the choice of ujson. I came across an article (https://pythonspeed.com/articles/faster-json-library/) that noted in 4/2019 that "ujson has a number of bugs filed regarding crashes, and even those crashes that have been fixed aren’t always available because there hasn’t been a release since 2016." That particular article considered instead rapidjson and orjson as candidates, with a use case of encoding relatively short messages. Our application is the opposite, to decode relatively short messages (avg length 1K), so we might have a different result.

Hmm. Looking at https://pypi.org/project/orjson/, it says that while its encoding speeds are faster than others, the decoding speed is about the same to 2x the standard. And it doesn't mention Python2 support (though why would anyone anymore?). Rapidjson is apparently another C library with Python bindings, and its performance is documented here: https://python-rapidjson.readthedocs.io/en/latest/benchmarks.html. It suggests that for our deserialization task, all the packages are probably comparable, except in the case of significant Unicode strings, where simplejson and stdlib json both slow way down. So those results suggest that ujson is a fine enough choice.

I wasn't originally intending on implementing something like FastJson, but the following two things compelled me to write some basic shell of an implementation:

Luigi, via stevedore plugins, kept importing all the python files in the repo, despite not actually wanting to use any tasks that are not required for financial reporting.

cjson was uninstallable in a python 3 virtualenv, and possibly also vice versa (ujson + python 2)

We don't need to evaluate the json part of this PR any further if it does not block the proper functioning of the financial reporting tasks.

This does not change any infrastructure to actually run anything under python 3, it only represents the results of my testing of the finance unit tests and acceptance tests under python 3 and modernizing the code to become both python 2 and 3 compatible.

Change default for now to python 3.6, for testing. Also cast INT to STRING before UNION in Hive code. Pass 'future' package to all remote nodes. Fix creation of Vertica marker tables in Python 3. Convert OrderTransactionRecord from bytes.

pwnage101 · 2019-07-26T19:43:36Z

Makefile

@@ -56,20 +56,48 @@ upgrade: ## update the requirements/*.txt files with the latest packages satisfy
 	CUSTOM_COMPILE_COMMAND="make upgrade" pip-compile --upgrade -o requirements/docs.txt requirements/docs.in
 	CUSTOM_COMPILE_COMMAND="make upgrade" pip-compile --upgrade -o requirements/test.txt requirements/test.in

-test-docker-local:
-	docker run --rm -u root -v `(pwd)`:/edx/app/analytics_pipeline/analytics_pipeline -it edxops/analytics_pipeline:latest make develop-local test-local
-


FYI, I deleted this make target because I couldn't figure out what it's useful for. It isn't programmatically called, nor referenced by any documentation that I could find. Since it doesn't setup requirements via make system-requirements reset-virtualenv test-requirements, I can't imagine that it works at all.

The purpose of the target is precisely to run tests without performing setup. Again. When trying to do development or debugging on unit tests where the requirements aren't changing but only code, it suffices to run test-docker once, and then run test-docker-local on subsequent runs. It saves the time involved in trying to re-install the requirements each time. Same motivation for many of the other "local" targets.

brianhw · 2019-07-29T07:46:05Z

edx/analytics/tasks/launchers/local.py

@@ -93,15 +100,20 @@ def main():
    # Tell luigi what dependencies to pass to the Hadoop nodes:
    # - edx.analytics.tasks is used to load the pipeline code, since we cannot trust all will be loaded automatically.
    # - boto is used for all direct interactions with s3.
-    # - cjson is used for all parsing event logs.
-    # - filechunkio is used for multipart uploads of large files to s3.


Note that filechunkio is not used anymore, so it seemed a good time to remove it from here for starters.

brianhw · 2019-07-29T07:49:36Z

edx/analytics/tasks/launchers/remote.py

@@ -119,6 +121,8 @@ def run_task_playbook(inventory, arguments, uid):
    if arguments.workflow_profiler:
        env_vars['WORKFLOW_PROFILER'] = arguments.workflow_profiler
        env_vars['WORKFLOW_PROFILER_PATH'] = log_dir
+    if arguments.python_version:
+        env_vars['HADOOP_PYTHON_EXECUTABLE'] = arguments.python_version


I don't really understand the purpose of this way of passing python_version. What is the environment variable being used for? Anything other than acceptance tests?

brianhw · 2019-07-29T07:52:03Z

edx/analytics/tasks/tests/acceptance/__init__.py

@@ -206,7 +208,12 @@ def setUp(self):
        elasticsearch_alias = 'alias_test_' + self.identifier
        self.warehouse_path = url_path_join(self.test_root, 'warehouse')
        self.edx_rest_api_cache_root = url_path_join(self.test_src, 'edx-rest-api-cache')
+        # Use config directly, rather than os.getenv('HADOOP_PYTHON_EXECUTABLE', '/usr/bin/python')
+        python_executable = self.config.get('python_version', '/usr/bin/python')


I changed this code to not pull from the environment variable, because it was circular (at least on Jenkins). The executable is being passed into the acceptance tests explicitly, and needs to be loaded into the config file the tests use, so we just do that. The acceptance tests then call remote-task with the resulting config file (in a temp file location) as an arg.

brianhw · 2019-07-29T07:58:22Z

edx/analytics/tasks/util/vertica_target.py

@@ -146,7 +146,8 @@ def create_marker_table(self):
                """.format(marker_schema=self.marker_schema, marker_table=self.marker_table)
            )
        except vertica_python.errors.QueryError as err:
-            if 'Sqlstate: 42710' in err.args[0]:  # This Sqlstate will appear if the marker table already exists.
+            # Sqlstate 42710 will appear if the marker table already exists.
+            if 'Sqlstate:' in err.args[0] and '42710' in err.args[0]:


The error message in Python3 is slightly different: b'Object "table_updates" already exists', Sqlstate: b'42710', So this is a way of matching against Py2 or Py3 forms.

brianhw · 2019-07-29T07:59:47Z

edx/analytics/tasks/warehouse/financial/orders_import.py

@@ -298,7 +299,7 @@ def insert_query(self):
                    -- the complete line item quantity and amount
                    IF(oi.status = 'refunded', oi.qty * oi.unit_cost, NULL) AS refunded_amount,
                    IF(oi.status = 'refunded', oi.qty, NULL) AS refunded_quantity,
-                    oi.order_id AS payment_ref_id,
+                    CAST(oi.order_id AS STRING) AS payment_ref_id,


This and the next CAST are required to work with the updated version of Hive on EMR 5.14 and EMR 5.24, among the versions tested (back in the day).

brianhw · 2019-07-29T08:03:08Z

requirements/base.in

 boto==2.48.0            # MIT
 ecdsa==0.13 		# MIT
 Jinja2		        # BSD
 pycrypto==2.6.1 	# public domain
 wheel==0.30.0
+future                  # MIT


Not a comment about this, but we might want to revisit the freezing of six==1.10.0 below. It may not be necessary any more.

brianhw · 2019-07-29T08:04:45Z

requirements/default.in

 python-gnupg==0.3.9	# BSD
 pytz==2017.3		# ZPL
 requests==2.18.4        # Apache 2.0
-six==1.10.0		# MIT
+six                     # MIT


Note that this is set to 1.10.0 in base.txt as well.

brianhw · 2019-07-29T08:06:00Z

requirements/default.in

@@ -4,7 +4,7 @@
 -r base.txt

 argparse==1.2.1 	# Python Software Foundation License
-boto3==1.4.8            # Apache 2.0
+boto3                   # Apache 2.0


Reminder we should remove filechunkio.

Cybersource: Replaced deprecated servlets method with REST API.

pwnage101 commented Jul 17, 2019

View reviewed changes

brianhw reviewed Jul 18, 2019

View reviewed changes

brianhw reviewed Jul 19, 2019

View reviewed changes

pwnage101 force-pushed the pwnage101/de-1589-py3-in-finance-tasks branch 2 times, most recently from 6193150 to e1b528a Compare July 26, 2019 13:48

pwnage101 force-pushed the pwnage101/de-1589-py3-in-finance-tasks branch 2 times, most recently from daff797 to 3daf46d Compare July 26, 2019 14:57

percolate python_version down to the python-executable hadoop config

8d5f492

pwnage101 force-pushed the pwnage101/de-1589-py3-in-finance-tasks branch from 3daf46d to 8d5f492 Compare July 26, 2019 15:00

brianhw and others added 4 commits July 26, 2019 12:31

Brian/de 1589 py3 in finance tasks (#750)

803501b

Change default for now to python 3.6, for testing. Also cast INT to STRING before UNION in Hive code. Pass 'future' package to all remote nodes. Fix creation of Vertica marker tables in Python 3. Convert OrderTransactionRecord from bytes.

fixup syntax

5ff56f8

fixup: use os.getenv

63a3591

fixup: override virtualenv_python correctly in remote-task

e34ab6d

pwnage101 commented Jul 26, 2019

View reviewed changes

brianhw added 2 commits July 26, 2019 18:09

revert ansible default to python.

6fda690

Get python_executable into config.

594a291

brianhw reviewed Jul 29, 2019

View reviewed changes

pwnage101 and others added 2 commits July 29, 2019 10:42

Use six.moves instead of future's install_aliases()

98b69af

Get cybersource and PayPal tasks running in py3.

abd2091

brianhw force-pushed the pwnage101/de-1589-py3-in-finance-tasks branch from c82f9f5 to abd2091 Compare July 29, 2019 21:29

pwnage101 and others added 4 commits July 30, 2019 15:27

fixup, futile attempt to fix finance acceptance tests under py2

c8b255b

Enable verbose for finance-related Sqoop imports.

e111a6c

Cybersource: Replaced deprecated servlets method with REST API.

d85796c

Merge pull request #748 from edx/hassan/cybersource-rest-api

84c0599

Cybersource: Replaced deprecated servlets method with REST API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python 2/3 compatibility for finance tasks #747

Python 2/3 compatibility for finance tasks #747

pwnage101 commented Jul 17, 2019 •

edited

Loading

pwnage101 Jul 17, 2019

brianhw Jul 18, 2019

pwnage101 Jul 25, 2019

pwnage101 Jul 17, 2019

pwnage101 Jul 17, 2019

pwnage101 commented Jul 17, 2019 •

edited

Loading

brianhw left a comment

brianhw Jul 18, 2019

pwnage101 Jul 24, 2019

brianhw Jul 18, 2019

brianhw Jul 18, 2019

pwnage101 Jul 23, 2019

brianhw Jul 18, 2019

brianhw Jul 18, 2019

brianhw Jul 19, 2019

pwnage101 Jul 23, 2019 •

edited

Loading

pwnage101 Jul 26, 2019

brianhw Jul 29, 2019

brianhw Jul 29, 2019

brianhw Jul 29, 2019

brianhw Jul 29, 2019

brianhw Jul 29, 2019

brianhw Jul 29, 2019

brianhw Jul 29, 2019

brianhw Jul 29, 2019

brianhw Jul 29, 2019

		@@ -1,2 +1,2 @@
		pip==9.0.1
		pip==19.1.1

Python 2/3 compatibility for finance tasks #747

Are you sure you want to change the base?

Python 2/3 compatibility for finance tasks #747

Conversation

pwnage101 commented Jul 17, 2019 • edited Loading

Testing requirements

Analytics Pipeline Pull Request

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwnage101 commented Jul 17, 2019 • edited Loading

brianhw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwnage101 Jul 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwnage101 commented Jul 17, 2019 •

edited

Loading

pwnage101 commented Jul 17, 2019 •

edited

Loading

pwnage101 Jul 23, 2019 •

edited

Loading