Skip to content
This repository has been archived by the owner on May 1, 2024. It is now read-only.

WIP-Spark Tasks #476

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

WIP-Spark Tasks #476

wants to merge 14 commits into from

Conversation

rao-abdul-mannan
Copy link
Contributor

@rao-abdul-mannan rao-abdul-mannan commented Jan 24, 2018

This PR includes following tasks:

TotalEventsDailyTask
UserActivityTask
CourseActivityPartitionTask
InternalReportingUserActivityPartitionTask

Note: Not to be merged with master branch yet

@codecov
Copy link

codecov bot commented Jan 24, 2018

Codecov Report

Merging #476 into master will decrease coverage by 0.89%.
The diff coverage is 31.16%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #476     +/-   ##
=========================================
- Coverage   77.68%   76.79%   -0.9%     
=========================================
  Files         193      196      +3     
  Lines       21344    22004    +660     
=========================================
+ Hits        16582    16898    +316     
- Misses       4762     5106    +344
Impacted Files Coverage Δ
edx/analytics/tasks/launchers/remote.py 0% <0%> (ø) ⬆️
edx/analytics/tasks/util/manifest.py 98.03% <100%> (+0.03%) ⬆️
...alytics/tasks/insights/tests/test_user_activity.py 100% <100%> (ø) ⬆️
edx/analytics/tasks/util/constants.py 100% <100%> (ø)
edx/analytics/tasks/util/spark_util.py 17.8% <17.8%> (ø)
edx/analytics/tasks/common/spark.py 26.12% <26.12%> (ø)
...dx/analytics/tasks/insights/location_per_course.py 79.5% <30%> (-6.95%) ⬇️
edx/analytics/tasks/insights/user_activity.py 57.35% <34.14%> (-16.34%) ⬇️
edx/analytics/tasks/monitor/overall_events.py 77.41% <50%> (-17.32%) ⬇️
edx/analytics/tasks/monitor/total_events_report.py 94.59% <75%> (+0.15%) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5e5e34...e5d3e87. Read the comment docs.

@rao-abdul-mannan rao-abdul-mannan force-pushed the mannan/spark_tasks branch 2 times, most recently from 2ab865d to 7c1d21f Compare May 28, 2018 17:04
@rao-abdul-mannan rao-abdul-mannan force-pushed the mannan/spark_tasks branch 2 times, most recently from fe100b0 to 5fff726 Compare June 1, 2018 15:03
@rao-abdul-mannan rao-abdul-mannan changed the title Mannan/spark tasks WIP-Spark Tasks Jun 1, 2018
@rao-abdul-mannan rao-abdul-mannan force-pushed the mannan/spark_tasks branch 9 times, most recently from 6309079 to 2f1ec9c Compare June 6, 2018 15:13
@rao-abdul-mannan rao-abdul-mannan force-pushed the mannan/spark_tasks branch 3 times, most recently from 50be7dc to 86fd011 Compare June 12, 2018 21:26
Copy link
Contributor

@brianhw brianhw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress! Sorry to raise questions about marker files, but it's a challenge in general in any schema to figure out what has been done and what needs to be done.

FROM (
SELECT
event_date as dt, user_id, course_id, timestamp, ip,
ROW_NUMBER() over ( PARTITION BY event_date, user_id, course_id ORDER BY timestamp desc) as rank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for consistency, I would uppercase all keywords, including OVER and AS (here and the line above).

WHERE rank = 1
"""
result = self._spark.sql(query)
result.coalesce(4).write.partitionBy('dt').csv(self.output_dir().path, mode='append', sep='\t')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to add a comment here, that this will produce four TSV files in the dt= subdirectory under the output_dir().

But a question: is there a reason why the mode is 'append'?

And a second question: I've noticed that write() generally creates a _SUCCESS file separate from the actual output files but in the same directory. Will this use of partitionBy() do the same, and if so, where would it write this file? In each partition, or in the output_dir()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will write to the output_dir() ( not the individual partition )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With overwrite mode, it won't keep other partitions i.e. those partitions which aren't in this interval, will be deleted from the output_dir() path. So I'm using append mode and explicitly remove those partitions which will be written by the job to avoid duplicates.

class LastDailyIpAddressOfUserTaskSpark(EventLogSelectionMixinSpark, WarehouseMixin, SparkJobTask):
"""Spark alternate of LastDailyIpAddressOfUserTask"""

output_parent_dir = 'last_ip_of_user_id'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about "output_parent_dirname"? When I first saw this below, I was unclear whether it was a name or a path.

"""
Output directory for spark task
"""
return get_target_from_url(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is a target instead of just a URL? Is that the easiest way to get .exists() functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is for .exists() functionality.

for target in self.output_paths():
if target.exists():
target.remove()
super(LastDailyIpAddressOfUserTaskSpark, self).run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original, each date in the interval is checked to see if it produces actual data. If the data is sparse enough, then not all date partitions will be created. This was done because the downstream class needed to have all dates exist, even if empty, so as to know what dates had been processed. This was also dealt with using the downstream_input_tasks() to do this check in the LastCountryOfUser task.

But maybe this can be addressed when the follow-on (i.e. non-historic) workflow is written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My approach is to handle it with non-historic task and if it isn't possible then I'll make adjustments here.

2) create marker in separate dir just like hadoop multi-mapreduce task
Former approach can fail in cases, when there is no data for some dates as spark will not generate empty
partitions.
Later approach is more consistent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So this is how you're doing user location. You chose approach 2 over approach 1. That may be fine, though I'm wary about using marker files where we can avoid it, because they are magical and hard to track down. It is really hard to know why a job is not producing output when a marker file exists somewhere with an opaque hash for a name. Also, because the hash comes from hashing the task, which effectively hashes the (significant) parameters, one will get a different hash for each interval. So if I run different but overlapping intervals, it will do an overwrite, but it doesn't actually check that the intervals that have been run are contiguous, or overlapping, or whatever. They only check that the same exact interval has or hasn't been run before.

That's why the location workflow verified partitions for all dates in the interval, and that the run() method created empty partitions for dates that had no data after the multi-mapreduce was done. (That's not just the issue of spark but an issue of multi-mapreduce in general, whether spark or map-reduce.)

But perhaps there are other reasons to go with the marker over the partition, like performance. But I'm wary that they're worth the complexity for the user.

Copy link
Contributor Author

@rao-abdul-mannan rao-abdul-mannan Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark doesn't create empty partitions if there is no data when writing output using partitionBy(). In such cases, checking for partitions existence will fail so I opted for the marker approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll echo Brian's thoughts on using an obscure named marker file to indicate if a job has succeeded in the past. We've obviously done it in the past, and in this specific example we've done it. I'd like to challenge engineers to see if we can come up with a better method even if all we find out is that our current implementation is the cleanest option. In this case I'm fine with continuing our past behaviors.


def on_success(self): # pragma: no cover
"""Overload the success method to touch the _SUCCESS file. Any class that uses a separate Marker file from the
data file will need to override the base on_success() call to create this marker."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make docstring multiline....

if self.output_dir().exists(): # only check partitions if parent dir exists
for target in self.output_paths():
if target.exists():
target.remove()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to put in a comment here, because I'm not remembering if we did this because we were writing in append mode, or because of issues if we reduced the number of partitions we coalesced down to on output, or because Spark really doesn't want the files around when it's planning on writing them out. Do you remember?

Copy link
Contributor Author

@rao-abdul-mannan rao-abdul-mannan Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done to avoid duplicates in case of append mode.

Copy link
Contributor

@azafty468 azafty468 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comment as I read the code. If I remember correctly the changes in the spark base PR was getting merged and these were being left unmerged. I could be wrong. The comments I wrote are all forward looking to future PRs. If we aren't merging this guy no additional work is necessary.


def user_activity_hive_table_path(self, *args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as we convert jobs we'll not want to reference the word "hive". It definitely makes sense now when you are interleaving Spark components. But in the future we'll have no hive components and names like this will be confusing. Again though, no change in this PR, just for future reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, i'll keep it in mind.

2) create marker in separate dir just like hadoop multi-mapreduce task
Former approach can fail in cases, when there is no data for some dates as spark will not generate empty
partitions.
Later approach is more consistent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll echo Brian's thoughts on using an obscure named marker file to indicate if a job has succeeded in the past. We've obviously done it in the past, and in this specific example we've done it. I'd like to challenge engineers to see if we can come up with a better method even if all we find out is that our current implementation is the cleanest option. In this case I'm fine with continuing our past behaviors.

def get_event_time_string(event_time):
"""Returns the time of the event as an ISO8601 formatted string."""
try:
# Get entry, and strip off time zone information. Keep microseconds, if any.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dah, time string processing is a hobgoblin. I suspect all of our event times are converted to UTC. We may want to confirm that and not drop the timezone field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is exactly the same as the one we use for hadoop.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants