Skip to content

Latest commit

 

History

History
25 lines (13 loc) · 4.98 KB

Why Do We Need Bundles.md

File metadata and controls

25 lines (13 loc) · 4.98 KB

Why Do We Need Bundles?

Some users, when exposed to the concept of an Oozie bundle for the first time, are a little confused about its usefulness and necessity. Users need and want to run workflows. They also understand the coordinator and its features. But the benefits of an Oozie bundle are not readily apparent. So it might be instructive to go through some concrete use cases and the value of using an Oozie bundle in those example scenarios. bundles are basically available for operational convenience more than anything else.

Let’s look at a typical use case of a rather large Internet company that makes its revenue through advertising and ad clicks. Let’s say that Apache web logs are collected in a low-latency batch and delivered to the backend. The data pipeline then picks it up and kicks off a variety of processing on it. The list of applications using this input log data include but is not limited to the following workflows:

  • There is one workflow that counts ad clicks, calculates the cost to the advertiser account IDs, does some basic comparisons to the same time of the day last week to make sure there are no abnormalities, and publishes a revenue feed. This workflow is called the Revenue WF and runs every 15 minutes.

  • There is a Targeting WF that looks at the user IDs corresponding to the ad clicks and does some processing to segment them for behavioral AD targeting. This workflow also runs every 15 minutes, but it satisfies a completely different business requirement than the revenue WF and is developed and managed by another team.

  • There is an Hourly workflow called the AD-UI WF that rolls up the 15 minute revenue feeds generated by the revenue WF and pushes a feed to a operational database that feeds an advertiser user interface. This UI is where advertisers and customers log in and track their AD expenditure at an hourly grain.

  • ==There is a Reporting WF that runs daily in the morning to aggregate a lot of the data from the previous day and generate daily canned reports for the executives of the company.==

  • Last but not the least, the advertiser billing logic and the SOX (Sarbanes–Oxley) compliance checks run monthly because that’s when the larger advertisers actually get a bill and are expected to pay. They don’t actually pay daily or hourly. This makes up the Billing WF and involves monthly aggregations and rollups.

    广告客户的结算逻辑,

Given the varied use cases detailed here, you can see how the entire, consolidated data pipeline can get rather complex. There are several moving parts and interdependencies, though these individual use cases seem to fit nicely into individual Oozie workflows. There will be corresponding coordinator apps that take care of the necessary time and data triggers for these workflows. The same input dataset (weblogs) drives all of the processing, but different groups within the company actually own specific business use cases. Table 8-1 summarizes these workflows and their time frequency and business owners.

[Table 8-1]

In addition to the time frequency, the workflows also have data dependencies among them. For instance, the monthly billing WF will be dependent on the entire month’s worth of revenue feeds from the AD-UI WF, Which itself is dependent on the output of the revenue WF. These dependencies can be specified via the coordinator app like we saw in Chapter 6. Bear in mind that there is a one-to-one correspondence between a coordinator app and the workflow it runs. So a coordinator by definition cannot run two workflows of different frequencies as part of one job.

Assuming the layout of coordinator and workflow apps as defined in the previous paragraph, let’s look at some failure scenarios that are common in such a complex data pipeline. Let’s say the operations team finds out at 11 p.m. on March 31 that some of the data for that day is missing. Specifically, there was a network hiccup that caused some silent data loss in the previous four hours starting at 7 p.m. It is finally detected and a high-priority alert is issued. Many operations teams across the organization get into an emergency mode to fix the issue. Once the issue is fixed, the old data that’s missing will be delivered to the data pipeline. ==But the pipeline is long done with hours 7 through 9 and is minutes away from kicking of the hourlies for the 10 p.m. hour.== And we are also pretty close to the dailies kicking off and the monthly billing is not too far either, as this is the last day of the month. There is no point kicking off the daily and monthly jobs without completing the reprocessing of the last four hours. The operations team has to stall all those coordinator jobs and reprocess the 15-minute and hourly ones from the last four hours. The coordinator has the right tools and options for suspending, starting, and reprocessing all those jobs, but it’s a lot of manual work for the data pipeline operations team responsible for all these coordinator jobs. As we all know, manual processing is quite error-prone.