Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy edits #3

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions chapters/chapter-3.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ Let's take some time now to explore what it good dynamic analytics systems. We'r

## The fundamental problem

I'll assert that to build good analytics applications, the foundation trait that your design needs it to *pull apart* the different pieces of the computation. Modern platforms frequently give you one level of abstraction to work with: collections processing. It's certainly alluring abstraction. What could be more comfortable than working with something that *feels* like function composition, but can be transparently applied to petabytes worth of data?
I'll assert that to build good analytics applications, the foundation trait that your design needs is to *pull apart* the different pieces of the computation. Modern platforms frequently give you one level of abstraction to work with: collections processing. It's certainly an alluring abstraction. What could be more comfortable than working with something that *feels* like function composition, but can be transparently applied to petabytes worth of data?

Collections processing (Spark), pipe assembly (Cascading), and SQL descriptions (Hive) all suffer from the same trait - they conflate the computational structure, execution context, resource lifecycles - to name but attributes. Exposing an API of *only* code is poison to the developer who's trying to make truely dynamic systems. Dynamic systems, by their very nature, know little-to-no information about their programmatic execution at compile time. Such systems must communicate in a machine-to-machine fashion, assembling their computations at runtime. These communication lines need to cross programming languages, networks, and time. A code-only API doesn't help us much.
Collections processing (Spark), pipe assembly (Cascading), and SQL descriptions (Hive) all suffer from the same trait - they conflate the computational structure, execution context, resource lifecycles - to name but a few attributes. Exposing an API of *only* code is poison to the developer who's trying to make truely dynamic systems. Dynamic systems, by their very nature, know little-to-no information about their programmatic execution at compile time. Such systems must communicate in a machine-to-machine fashion, assembling their computations at runtime. These communication lines need to cross programming languages, networks, and time. A code-only API doesn't help us much.

Let's be a bit more concrete:

Expand Down Expand Up @@ -43,7 +43,7 @@ user=> h

That's not terribly helpful. To be clear, I'm *not* criticizing Clojure here. The point that I'm making is that composition of code obscures the structure of what lies within the computation. Make no mistake: **an inability to understand the structure of your distributed processing program at runtime is deadly for diagnosing problems**.

How can we work around this problem? As Clojurists, we know the that if we're not working with code, then we're working with data. Strong analytics systems are aggressively data drive. Data interfaces make your designs simpler, and that increases business opportunity. The technique that we need to employ is similar to how Clojure treats side effects. We don't pretend that code is absent from our programs. Rather, we *minimize* the use of code and describe our programs as data as the norm.
How can we work around this problem? As Clojurists, we know the that if we're not working with code, then we're working with data. Strong analytics systems are aggressively data driven. Data interfaces make your designs simpler, and that increases business opportunity. The technique that we need to employ is similar to how Clojure treats side effects. We don't pretend that code is absent from our programs. Rather, we *minimize* the use of code and express our programs as data.

There are lots of great papers and talks out there on why data driven systems are often superior - I'd recommend searching around. I'm going to spend the rest of this chapter calling out the advantages that are gained specifically for distributed data processing systems.

Expand Down