WIP-Spark Tasks #476

rao-abdul-mannan · 2018-01-24T19:29:41Z

This PR includes following tasks:

TotalEventsDailyTask
UserActivityTask
CourseActivityPartitionTask
InternalReportingUserActivityPartitionTask

Note: Not to be merged with master branch yet

codecov · 2018-01-24T19:36:43Z

Codecov Report

Merging #476 into master will decrease coverage by 0.89%.
The diff coverage is 31.16%.

@@            Coverage Diff            @@
##           master     #476     +/-   ##
=========================================
- Coverage   77.68%   76.79%   -0.9%     
=========================================
  Files         193      196      +3     
  Lines       21344    22004    +660     
=========================================
+ Hits        16582    16898    +316     
- Misses       4762     5106    +344

Impacted Files	Coverage Δ
edx/analytics/tasks/launchers/remote.py	`0% <0%> (ø)`	⬆️
edx/analytics/tasks/util/manifest.py	`98.03% <100%> (+0.03%)`	⬆️
...alytics/tasks/insights/tests/test_user_activity.py	`100% <100%> (ø)`	⬆️
edx/analytics/tasks/util/constants.py	`100% <100%> (ø)`
edx/analytics/tasks/util/spark_util.py	`17.8% <17.8%> (ø)`
edx/analytics/tasks/common/spark.py	`26.12% <26.12%> (ø)`
...dx/analytics/tasks/insights/location_per_course.py	`79.5% <30%> (-6.95%)`	⬇️
edx/analytics/tasks/insights/user_activity.py	`57.35% <34.14%> (-16.34%)`	⬇️
edx/analytics/tasks/monitor/overall_events.py	`77.41% <50%> (-17.32%)`	⬇️
edx/analytics/tasks/monitor/total_events_report.py	`94.59% <75%> (+0.15%)`	⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5e5e34...e5d3e87. Read the comment docs.

brianhw

Good progress! Sorry to raise questions about marker files, but it's a challenge in general in any schema to figure out what has been done and what needs to be done.

brianhw · 2018-06-13T06:02:35Z

edx/analytics/tasks/insights/location_per_course.py

+                FROM (
+                    SELECT
+                        event_date as dt, user_id, course_id, timestamp, ip,
+                        ROW_NUMBER() over ( PARTITION BY event_date, user_id, course_id ORDER BY timestamp desc) as rank


nit: for consistency, I would uppercase all keywords, including OVER and AS (here and the line above).

brianhw · 2018-06-13T06:05:24Z

edx/analytics/tasks/insights/location_per_course.py

+                WHERE rank = 1
+                """
+        result = self._spark.sql(query)
+        result.coalesce(4).write.partitionBy('dt').csv(self.output_dir().path, mode='append', sep='\t')


It might be good to add a comment here, that this will produce four TSV files in the dt= subdirectory under the output_dir().

But a question: is there a reason why the mode is 'append'?

And a second question: I've noticed that write() generally creates a _SUCCESS file separate from the actual output files but in the same directory. Will this use of partitionBy() do the same, and if so, where would it write this file? In each partition, or in the output_dir()?

It will write to the output_dir() ( not the individual partition )

With overwrite mode, it won't keep other partitions i.e. those partitions which aren't in this interval, will be deleted from the output_dir() path. So I'm using append mode and explicitly remove those partitions which will be written by the job to avoid duplicates.

brianhw · 2018-06-13T06:06:33Z

edx/analytics/tasks/insights/location_per_course.py

+class LastDailyIpAddressOfUserTaskSpark(EventLogSelectionMixinSpark, WarehouseMixin, SparkJobTask):
+    """Spark alternate of LastDailyIpAddressOfUserTask"""
+
+    output_parent_dir = 'last_ip_of_user_id'


nit: how about "output_parent_dirname"? When I first saw this below, I was unclear whether it was a name or a path.

brianhw · 2018-06-13T06:39:59Z

edx/analytics/tasks/insights/location_per_course.py

+        """
+        Output directory for spark task
+        """
+        return get_target_from_url(


Is there a reason this is a target instead of just a URL? Is that the easiest way to get .exists() functionality?

Yes, it is for .exists() functionality.

brianhw · 2018-06-13T06:42:46Z

edx/analytics/tasks/insights/location_per_course.py

+            for target in self.output_paths():
+                if target.exists():
+                    target.remove()
+        super(LastDailyIpAddressOfUserTaskSpark, self).run()


In the original, each date in the interval is checked to see if it produces actual data. If the data is sparse enough, then not all date partitions will be created. This was done because the downstream class needed to have all dates exist, even if empty, so as to know what dates had been processed. This was also dealt with using the downstream_input_tasks() to do this check in the LastCountryOfUser task.

But maybe this can be addressed when the follow-on (i.e. non-historic) workflow is written.

My approach is to handle it with non-historic task and if it isn't possible then I'll make adjustments here.

brianhw · 2018-06-13T06:57:12Z

edx/analytics/tasks/insights/user_activity.py

+        2) create marker in separate dir just like hadoop multi-mapreduce task
+        Former approach can fail in cases, when there is no data for some dates as spark will not generate empty
+        partitions.
+        Later approach is more consistent.


I see. So this is how you're doing user location. You chose approach 2 over approach 1. That may be fine, though I'm wary about using marker files where we can avoid it, because they are magical and hard to track down. It is really hard to know why a job is not producing output when a marker file exists somewhere with an opaque hash for a name. Also, because the hash comes from hashing the task, which effectively hashes the (significant) parameters, one will get a different hash for each interval. So if I run different but overlapping intervals, it will do an overwrite, but it doesn't actually check that the intervals that have been run are contiguous, or overlapping, or whatever. They only check that the same exact interval has or hasn't been run before.

That's why the location workflow verified partitions for all dates in the interval, and that the run() method created empty partitions for dates that had no data after the multi-mapreduce was done. (That's not just the issue of spark but an issue of multi-mapreduce in general, whether spark or map-reduce.)

But perhaps there are other reasons to go with the marker over the partition, like performance. But I'm wary that they're worth the complexity for the user.

Spark doesn't create empty partitions if there is no data when writing output using partitionBy(). In such cases, checking for partitions existence will fail so I opted for the marker approach.

I'll echo Brian's thoughts on using an obscure named marker file to indicate if a job has succeeded in the past. We've obviously done it in the past, and in this specific example we've done it. I'd like to challenge engineers to see if we can come up with a better method even if all we find out is that our current implementation is the cleanest option. In this case I'm fine with continuing our past behaviors.

brianhw · 2018-06-13T06:57:41Z

edx/analytics/tasks/insights/user_activity.py

+
+    def on_success(self):  # pragma: no cover
+        """Overload the success method to touch the _SUCCESS file.  Any class that uses a separate Marker file from the
+        data file will need to override the base on_success() call to create this marker."""


nit: make docstring multiline....

brianhw · 2018-06-13T07:06:13Z

edx/analytics/tasks/insights/location_per_course.py

+        if self.output_dir().exists():      # only check partitions if parent dir exists
+            for target in self.output_paths():
+                if target.exists():
+                    target.remove()


It might be good to put in a comment here, because I'm not remembering if we did this because we were writing in append mode, or because of issues if we reduced the number of partitions we coalesced down to on output, or because Spark really doesn't want the files around when it's planning on writing them out. Do you remember?

This is done to avoid duplicates in case of append mode.

azafty468

I left a few comment as I read the code. If I remember correctly the changes in the spark base PR was getting merged and these were being left unmerged. I could be wrong. The comments I wrote are all forward looking to future PRs. If we aren't merging this guy no additional work is necessary.

azafty468 · 2018-06-20T18:17:47Z

edx/analytics/tasks/insights/user_activity.py


+    def user_activity_hive_table_path(self, *args):


I think as we convert jobs we'll not want to reference the word "hive". It definitely makes sense now when you are interleaving Spark components. But in the future we'll have no hive components and names like this will be confusing. Again though, no change in this PR, just for future reference.

Right, i'll keep it in mind.

azafty468 · 2018-06-20T18:20:38Z

edx/analytics/tasks/insights/user_activity.py

+        2) create marker in separate dir just like hadoop multi-mapreduce task
+        Former approach can fail in cases, when there is no data for some dates as spark will not generate empty
+        partitions.
+        Later approach is more consistent.


I'll echo Brian's thoughts on using an obscure named marker file to indicate if a job has succeeded in the past. We've obviously done it in the past, and in this specific example we've done it. I'd like to challenge engineers to see if we can come up with a better method even if all we find out is that our current implementation is the cleanest option. In this case I'm fine with continuing our past behaviors.

azafty468 · 2018-06-20T18:25:55Z

edx/analytics/tasks/util/spark_util.py

+def get_event_time_string(event_time):
+    """Returns the time of the event as an ISO8601 formatted string."""
+    try:
+        # Get entry, and strip off time zone information.  Keep microseconds, if any.


Dah, time string processing is a hobgoblin. I suspect all of our event times are converted to UTC. We may want to confirm that and not drop the timezone field.

It is exactly the same as the one we use for hadoop.

rao-abdul-mannan requested review from macdiesel, azafty468 and brianhw January 24, 2018 19:29

rao-abdul-mannan force-pushed the mannan/spark_tasks branch 4 times, most recently from 4ce2ed2 to b89c71d Compare January 31, 2018 17:56

rao-abdul-mannan force-pushed the mannan/spark_tasks branch 3 times, most recently from 9dba8a9 to f65be75 Compare February 14, 2018 19:37

rao-abdul-mannan force-pushed the mannan/spark_tasks branch 3 times, most recently from cc759f3 to 7f1b57d Compare February 27, 2018 13:32

rao-abdul-mannan force-pushed the mannan/spark_tasks branch 3 times, most recently from dfab4d1 to 7d6b0c0 Compare February 28, 2018 10:30

rao-abdul-mannan mentioned this pull request Mar 14, 2018

Use spark task in production jobs #485

Closed

rao-abdul-mannan force-pushed the mannan/spark_tasks branch from 9def6c5 to f12ec82 Compare May 5, 2018 13:35

rao-abdul-mannan force-pushed the mannan/spark_tasks branch from 894f1b5 to fb527a4 Compare May 13, 2018 09:02

rao-abdul-mannan force-pushed the mannan/spark_tasks branch 3 times, most recently from 07343a0 to ae0277f Compare May 22, 2018 12:44

rao-abdul-mannan added 5 commits May 26, 2018 00:41

convert TotalEventsDailyTask task to spark #486

097b826

convert user activity task to spark

d615749

move spark runtime options to luigi config

d06f87f

replace username with user_id

78a842f

rebased with master before location history job conversion

0b8b937

rao-abdul-mannan force-pushed the mannan/spark_tasks branch 2 times, most recently from 2ab865d to 7c1d21f Compare May 28, 2018 17:04

user location history spark task

01f6b2b

rao-abdul-mannan force-pushed the mannan/spark_tasks branch 2 times, most recently from fe100b0 to 5fff726 Compare June 1, 2018 15:03

rao-abdul-mannan changed the title ~~Mannan/spark tasks~~ WIP-Spark Tasks Jun 1, 2018

rao-abdul-mannan force-pushed the mannan/spark_tasks branch 9 times, most recently from 6309079 to 2f1ec9c Compare June 6, 2018 15:13

rao-abdul-mannan added 2 commits June 7, 2018 21:58

using manifest file as input source to spark

b863f58

Enable logging in spark tasks

744684e

rao-abdul-mannan force-pushed the mannan/spark_tasks branch from e75c441 to 744684e Compare June 7, 2018 16:59

Process event logs directly

e204e49

rao-abdul-mannan force-pushed the mannan/spark_tasks branch from c40aded to e204e49 Compare June 8, 2018 11:59

using wildcards with dataframe source

2b7b94b

rao-abdul-mannan force-pushed the mannan/spark_tasks branch 3 times, most recently from 50be7dc to 86fd011 Compare June 12, 2018 21:26

brianhw reviewed Jun 13, 2018

View reviewed changes

rao-abdul-mannan force-pushed the mannan/spark_tasks branch from 86fd011 to ceda01d Compare June 13, 2018 09:55

with rdd

64308be

rao-abdul-mannan force-pushed the mannan/spark_tasks branch from ceda01d to 64308be Compare June 13, 2018 10:21

rao-abdul-mannan added 3 commits June 13, 2018 16:36

enable verbose logging

1a42858

address PR review changes

5f6726d

Parallelize file loading to reduce in-activity delays

e5d3e87

azafty468 approved these changes Jun 20, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP-Spark Tasks #476

WIP-Spark Tasks #476

rao-abdul-mannan commented Jan 24, 2018 •

edited

Loading

codecov bot commented Jan 24, 2018 •

edited

Loading

brianhw left a comment

brianhw Jun 13, 2018

brianhw Jun 13, 2018

rao-abdul-mannan Jun 14, 2018

rao-abdul-mannan Jun 14, 2018

brianhw Jun 13, 2018

brianhw Jun 13, 2018

rao-abdul-mannan Jun 14, 2018

brianhw Jun 13, 2018

rao-abdul-mannan Jun 14, 2018

brianhw Jun 13, 2018

rao-abdul-mannan Jun 14, 2018 •

edited

Loading

azafty468 Jun 20, 2018

brianhw Jun 13, 2018

brianhw Jun 13, 2018

rao-abdul-mannan Jun 14, 2018 •

edited

Loading

azafty468 left a comment

azafty468 Jun 20, 2018

rao-abdul-mannan Jun 20, 2018

azafty468 Jun 20, 2018

azafty468 Jun 20, 2018

rao-abdul-mannan Jun 20, 2018

WIP-Spark Tasks #476

Are you sure you want to change the base?

WIP-Spark Tasks #476

Conversation

rao-abdul-mannan commented Jan 24, 2018 • edited Loading

codecov bot commented Jan 24, 2018 • edited Loading

Codecov Report

brianhw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rao-abdul-mannan Jun 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rao-abdul-mannan Jun 14, 2018 • edited Loading

Choose a reason for hiding this comment

azafty468 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rao-abdul-mannan commented Jan 24, 2018 •

edited

Loading

codecov bot commented Jan 24, 2018 •

edited

Loading

rao-abdul-mannan Jun 14, 2018 •

edited

Loading

rao-abdul-mannan Jun 14, 2018 •

edited

Loading