slides/spark-sql-multi-dimensional-aggregation.html

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

    <title>Apache Spark™ Workshop | Spark SQL | Multi-Dimensional Aggregation</title>

    <meta name="description" content="Apache Spark™ Workshop | Spark SQL | Multi-Dimensional Aggregation">
    <meta name="author" content="Jacek Laskowski">

    <link rel="stylesheet" href="reveal.js/css/reveal.css">
    <link rel="stylesheet" href="reveal.js/css/theme/beige.css">

    <!-- Theme used for syntax highlighting of code -->
    <link rel="stylesheet" href="reveal.js/lib/css/zenburn.css">

    <!-- Jacek: custom formatting -->
    <link rel="stylesheet" href="revealjs-css/jacek.css">

    <!-- Printing and PDF exports -->
    <script>
      var link = document.createElement('link');
      link.rel = 'stylesheet';
      link.type = 'text/css';
      link.href = window.location.search.match(/print-pdf/gi) ? 'reveal.js/css/print/pdf.css' : 'reveal.js/css/print/paper.css';
      document.getElementsByTagName('head')[0].appendChild(link);
    </script>
  </head>

  <body>
    <div class="reveal">

      <div class="footer">
        <footer style="font-size: small;">
          &copy; <a href="https://medium.com/@jaceklaskowski">Jacek Laskowski</a> 2019 / <a href="https://twitter.com/jaceklaskowski">@JacekLaskowski</a>
          / jacek@japila.pl
        </footer>
      </div>

      <div class="slides">

        <section class="intro" data-transition="zoom" id="home">
          <p>
            <img width="12%" style="background:none; border:none; box-shadow:none;" data-src="images/spark-logo.png">
            <img width="6%" src="images/jacek_laskowski_20141201_512px.png" style="border: 0">
          </p>
          <h1 style="font-size: 3.07em;">Multi-Dimensional Aggregation</h1>
          <h3>Apache Spark 2.4.4 / Spark SQL</h3>

          <hr />
          <h4 style="font-size: smaller;">
            <a href="https://twitter.com/jaceklaskowski">@jaceklaskowski</a> / <a href="https://stackoverflow.com/users/1305344/jacek-laskowski">StackOverflow</a> / <a href="https://github.com/jaceklaskowski">GitHub</a>
            <br>
            The "Internals" Books: <a href="https://bit.ly/apache-spark-internals">Apache Spark</a> / <a href="https://bit.ly/spark-sql-internals">Spark SQL</a> / <a href="https://bit.ly/spark-structured-streaming">Spark Structured Streaming</a>
          </h4>
        </section>

        <section id="agenda" data-markdown>
          <textarea data-template>
            ## Agenda

            1. [Multi-dimensional rollup Operator](#/rollup)
            1. [Multi-dimensional cube Operator](#/cube)
            1. [GROUPING SETS SQL clause](#/grouping-sets)
            1. [grouping Aggregate Function](#/grouping)
            1. [grouping_id Aggregate Function](#/grouping-id)
          </textarea>
        </section>

        <section id="rollup" data-markdown style="font-size: 85%">
          <textarea data-template>
            ## Multi-dimensional rollup Operator

            1. **rollup** calculates subtotals and totals over (ordered) combination of groups
              ```scala
              val inventory =
                Seq(("t1", 2015, 100), ("t1", 2016, 50), ("t2", 2016, 40))
                  .toDF("name", "year", "amount")
              inventory.rollup("name", "year").sum("amount")
              ```
              * **Quiz**: How many records in result set?
            1. Advanced variant of **groupBy** with higher efficiency
            1. Creates **RelationalGroupedDataset**
              * Supports untyped, Row-based **agg**
              * Shortcuts for _the usual suspects_, e.g. **avg**, **count**, **pivot**
            1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
              * [rollup Aggregation Operator](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-multi-dimensional-aggregation.html#rollup)
          </textarea>
        </section>

        <section>
          <section id="cube" data-markdown style="font-size: 90%">
            <textarea data-template>
              ## Multi-dimensional cube Operator

              1. **cube** is similar to **rollup** but calculates subtotals and totals over **all combinations** of groups
                ```scala
                val inventory =
                  Seq(("t1", 2015, 100), ("t1", 2016, 50), ("t2", 2016, 40))
                    .toDF("name", "year", "amount")
                inventory.cube("name", "year").sum("amount") // note cube (not rollup)
                ```
                * **Quiz**: How many records in result set?
              1. Advanced variant of **groupBy** with higher efficiency
              1. Creates **RelationalGroupedDataset**
              1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
                * [cube Aggregation Operator](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-multi-dimensional-aggregation.html#cube)
            </textarea>
          </section>
        </section>

        <section id="grouping-sets" data-markdown style="font-size: 90%">
          <textarea data-template>
            ## GROUPING SETS SQL clause

            1. Spark SQL supports **GROUPING SETS** in SQL only
            1. Used in GROUP BY allows to specify more than one GROUP BY option in the same record set
              * Equivalent to several GROUP BYs connected by UNION
            ```scala
Seq(("a1", "b1", 3), ("a1", "b2", 7), ("a2", "b2", 5))
  .toDF("a", "b", "c")
  .createOrReplaceTempView("t1")
val q = sql(
  "SELECT a,b, SUM(c) FROM t1 GROUP BY a, b GROUPING SETS (a,b)"
)
            ```
            **Quiz**: How many records in result set?
            1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
              * [GROUPING SETS SQL Clause](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-multi-dimensional-aggregation.html#grouping-sets)
          </textarea>
        </section>

        <section id="grouping" data-markdown>
          <textarea data-template>
            ## grouping Aggregate Function

            1. **grouping** - aggregate function that indicates whether a specified column in a GROUP BY list is aggregated or not
              ```scala
workshops
  .cube("city", "year")
  .agg(grouping("city"), grouping("year"))
  .sort($"city".desc_nulls_last, $"year".desc_nulls_last)
              ```
              * returns 1 for aggregated or 0 for not aggregated
            1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
              * [grouping Aggregate Function](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-functions.html#grouping)
          </textarea>
        </section>

        <section id="grouping-id" data-markdown>
          <textarea data-template>
            ## grouping_id Aggregate Function

            1. **Exercise** (hard): Can you guess yourself?
            1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
              * [grouping_id Aggregate Function](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-functions.html#grouping_id)
          </textarea>
        </section>

        <section id="recap" data-markdown>
          <textarea data-template>
            ## Recap

            1. [Multi-dimensional rollup Operator](#/rollup)
            1. [Multi-dimensional cube Operator](#/cube)
            1. [GROUPING SETS SQL clause](#/grouping-sets)
            1. [grouping Aggregate Function](#/grouping)
            1. [grouping_id Aggregate Function](#/grouping-id)
          </textarea>
        </section>

        <section style="text-align: left" data-markdown id="questions">
          <textarea data-template>
            # Questions?

            * Read [The Internals of Apache Spark](https://bit.ly/apache-spark-internals)
            * Read [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
            * Read [The Internals of Spark Structured Streaming](https://bit.ly/spark-structured-streaming)
            * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter
            * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski)
          </textarea>
        </section>

      </div>
    </div>

    <script src="reveal.js/lib/js/head.min.js"></script>
    <script src="reveal.js/js/reveal.js"></script>

    <script>
      // More info about config & dependencies:
      // - https://github.com/hakimel/reveal.js#configuration
      // - https://github.com/hakimel/reveal.js#dependencies
      Reveal.initialize({
        controls: true,
        progress: true,
        history: true,
        center: true,
        slideNumber: true,

        transition: 'slide', // none/fade/slide/convex/concave/zoom

        menu: {
          markers: true,
          openSlideNumber: true
        },
        dependencies: [
          { src: 'reveal.js/lib/js/classList.js', condition: function () { return !document.body.classList; } },
          { src: 'reveal.js/plugin/markdown/marked.js' },
          { src: 'reveal.js/plugin/markdown/markdown.js' },
          { src: 'reveal.js/plugin/zoom-js/zoom.js', async: true },
          { src: 'reveal.js/plugin/notes/notes.js', async: true },
          { src: 'reveal.js/plugin/highlight/highlight.js', async: true, callback: function () { hljs.initHighlightingOnLoad(); } }
        ]
      });
    </script>
    <script>
      (function (i, s, o, g, r, a, m) {
        i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function () {
          (i[r].q = i[r].q || []).push(arguments)
        }, i[r].l = 1 * new Date(); a = s.createElement(o),
          m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)
      })(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');

      ga('create', 'UA-45999426-3', 'auto');
      ga('send', 'pageview');

    </script>
  </body>
</html>