slides/spark-sql-joins.html

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

    <title>Apache Spark™ Workshop | Spark SQL | Joins</title>

    <meta name="description" content="Apache Spark™ Workshop | Spark SQL | Joins">
    <meta name="author" content="Jacek Laskowski">

    <link rel="stylesheet" href="reveal.js/css/reveal.css">
    <link rel="stylesheet" href="reveal.js/css/theme/beige.css">

    <!-- Theme used for syntax highlighting of code -->
    <link rel="stylesheet" href="reveal.js/lib/css/zenburn.css">

    <!-- Jacek: custom formatting -->
    <link rel="stylesheet" href="revealjs-css/jacek.css">

    <!-- Printing and PDF exports -->
    <script>
      var link = document.createElement('link');
      link.rel = 'stylesheet';
      link.type = 'text/css';
      link.href = window.location.search.match(/print-pdf/gi) ? 'reveal.js/css/print/pdf.css' : 'reveal.js/css/print/paper.css';
      document.getElementsByTagName('head')[0].appendChild(link);
    </script>
  </head>

  <body>
    <div class="reveal">

      <div class="footer">
        <footer style="font-size: small;">
          &copy; <a href="https://medium.com/@jaceklaskowski">Jacek Laskowski</a> 2022 / <a href="https://twitter.com/jaceklaskowski">@JacekLaskowski</a>
          / jacek@japila.pl
        </footer>
      </div>

      <div class="slides">

        <section class="intro" data-transition="zoom" id="home">
          <p>
            <img width="17%" style="background:none; border:none; box-shadow:none;" data-src="images/spark-logo.png">
            <img width="10%" src="images/jacek_laskowski_20201229_200x200.png" style="border: 0">
          </p>
          <h1>Joins</h1>
          <br />
          <h3>Apache Spark 3.2 / Spark SQL</h3>

          <hr />
          <h4 style="font-size: smaller;">
            <a href="https://twitter.com/jaceklaskowski">@jaceklaskowski</a> / <a
              href="https://stackoverflow.com/users/1305344/jacek-laskowski">StackOverflow</a> / <a
              href="https://github.com/jaceklaskowski">GitHub</a> / <a
              href="https://www.linkedin.com/in/jaceklaskowski/">LinkedIn</a>
            <br>
            The "Internals" Books: <a href="https://books.japila.pl/apache-spark-internals">Apache Spark</a> / <a
              href="https://books.japila.pl/spark-sql-internals/">Spark SQL</a> / <a
              href="https://books.japila.pl/delta-lake-internals">Delta Lake</a>
          </h4>
        </section>

        <section id="agenda" data-markdown>
          <textarea data-template>
            ## Agenda

            1. [join Operators](#/join-dataset)
            1. [Join Condition](#/join-condition)
            1. [Join Types](#/join-types)
            1. [Join Clause (SQL)](#/join-sql)
            1. [Broadcast Join (aka Map-Side Join)](#/broadcast-join)
            1. [Join Internals and Optimizations](#/join-internals)
          </textarea>
        </section>

        <section id="join-dataset" data-markdown>
          <textarea data-template>
            ## join Operators

            1. You can join two Datasets using join operators:
              * **join** for untyped Row-based joins
              * Type-preserving **joinWith**
              * **crossJoin** for explicit cartesian joins
            1. Join conditions in **join** or **filter** / **where** operators
<pre><code class="scala" data-trim style="word-wrap: break-word;">
  // equivalent to left("id") === right("id")
  left.join(right, "id")
  left.join(right, "id", "anti") // explicit join type
  left.join(right).filter(left("id") === right("id"))
  left.join(right).where(left("id") === right("id"))
</code></pre>
            <!-- .element: style="width: 810px;" -->
            1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
              * [Dataset Join Operators](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-joins.html)
          </textarea>
        </section>

        <section id="join-condition" data-markdown>
          <textarea data-template>
            ## Join Condition

            1. Join conditions in **join** or **filter** / **where** operators
<pre><code class="scala" data-trim style="word-wrap: break-word;">
  // equivalent to left("id") === right("id")
  left.join(right, "id")
  // explicit join type
  left.join(right, "id", "anti")
  left.join(right).filter(left("id") === right("id"))
  left.join(right).where(left("id") === right("id"))
  // expr AND expr in multiple where's
  left.join(right)
    .where(left("id") === right("empid"))
    .where(left("name") === right("empname"))
</code></pre>
              <!-- .element: style="width: 810px;" -->
            1. Uses [Column API](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column)
              ```
                left("id") === right("empid")
                left("id") === right("empid") AND left("name") === right("empname")
              ```
              <!-- .element: style="width: 850px;" -->
          </textarea>
        </section>

        <section id="join-types" data-markdown>
          <textarea data-template>
            ## Join Types

            | SQL         | Names       |
            | ------------|-------------|
            | CROSS       | cross |
            | INNER       | inner |
            | FULL OUTER  | outer, full, fullouter |
            | LEFT ANTI   | leftanti |
            | LEFT OUTER  | leftouter, left |
            | LEFT SEMI   | leftsemi |
            | RIGHT OUTER | rightouter, right |
            <!-- .element: style="font-size: 0.8em;" -->
          </textarea>
        </section>

        <section id="join-sql" data-markdown>
          <textarea data-template>
            ## Join Clause (SQL)

            1. You can join two tables using regular SQL join operators
            1. Use **SparkSession.sql** to execute SQL
              ```scala
              spark.sql("select * from t1, t2 where t1.id = t2.id")
              ```
            1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
              * [Executing SQL Queries (aka SQL Mode) &mdash; sql Method](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-SparkSession.html#sql)
          </textarea>
        </section>

        <section id="broadcast-join" data-markdown>
          <textarea data-template>
            <!-- .slide: style="font-size: 90%" -->

            ## Broadcast Join (aka Map-Side Join)

            1. Use **broadcast** function to mark a Dataset to be broadcast
              ```scala
              left.join(broadcast(right), "token")
              ```
            1. Use **broadcast hints** in SQL
              ```
              SELECT /*+ BROADCAST (t1) */ * FROM t1, t2 WHERE t1.id = t2.id
              ```
              * **BROADCAST**, **BROADCASTJOIN** or **MAPJOIN** hints supported
            1. Switch to [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
              * [broadcast Function](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-functions.html#broadcast)
              * [withHints Parsing Handler](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-AstBuilder.html#withHints)
          </textarea>
        </section>

        <section>
          <section id="join-internals">
            <h1>Join</h1>
            <h1 style="font-size: 2.77em;">Internals and Optimizations</h1>
          </section>
          <section id="join-logical-operator">
            <h2>Join Logical Operator</h2>
            <ol>
              <li><b>Join</b> is a binary logical operator with two logical operators, a join type and an optional join expression</li>
              <li>Switch to <a href="https://bit.ly/spark-sql-internals">The Internals of Spark SQL</a>
                <ul>
                  <li><a href="https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-LogicalPlan-Join.html">Join Logical Operator</a></li>
                </ul>
              </li>
            </ol>
          </section>
          <section id="join-selection" style="font-size: 80%;">
            <h2>JoinSelection Execution Planning Strategy</h2>
            <ol>
              <li><b>JoinSelection</b> is an execution planning strategy that SparkPlanner uses to plan Join logical operators</li>
              <li>Supported joins (and their physical operators)
                <ol>
                  <li><b>Broadcast Hash Join</b> (BroadcastHashJoinExec)</li>
                  <li><b>Shuffled Hash Join</b> (ShuffledHashJoinExec)</li>
                  <li><b>Sort Merge Join</b> (SortMergeJoinExec)</li>
                  <li><b>Broadcast Nested Loop Join</b> (BroadcastNestedLoopJoinExec)</li>
                  <li><b>Cartesian Join</b> (CartesianProductExec)</li>
                </ol>
              </li>
              <li>Switch to <a href="https://bit.ly/spark-sql-internals">The Internals of Spark SQL</a>
                <ul>
                  <li><a href="https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-SparkStrategy-JoinSelection.html">JoinSelection Execution Planning Strategy</a></li>
                </ul>
              </li>
            </ol>
          </section>
          <section id="join-optimization-bucketing">
            <h2>Join Optimization &mdash; Bucketing</h2>
            <ol>
              <li><b>Bucketing</b> is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning.</li>
              <li>Optimize performance of a join query by reducing shuffles (aka exchanges)</li>
              <li>Switch to <a href="https://bit.ly/spark-sql-internals">The Internals of Spark SQL</a>
                <ul>
                  <li><a href="https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html">Bucketing</a></li>
                </ul>
              </li>
              <li>My talk <a href="https://youtu.be/dv7IIYuQOXI">Bucketing in Spark SQL 2 3</a> on Spark+AI Summit 2018</li>
            </ol>
          </section>
          <section id="join-optimization-join-reorg">
            <h2>Join Optimization &mdash; Join Reordering</h2>
            <ol>
              <li><b>Join Reordering</b> is an optimization of a logical query plan that the Spark Optimizer uses for joins</li>
              <li>Switch to <a href="https://bit.ly/spark-sql-internals">The Internals of Spark SQL</a>
                <ul>
                  <li><a href="https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Optimizer-ReorderJoin.html">ReorderJoin Logical Query Optimization</a></li>
                </ul>
              </li>
            </ol>
          </section>
          <section id="join-optimization-join-reorg-cbo" style="font-size: 80%;">
            <h2>Join Optimization &mdash; Cost-based Join Reordering</h2>
            <ol>
              <li><b>Cost-based Join Reordering</b> is an optimization of a logical query plan that the Spark Optimizer uses for joins in cost-based optimization</li>
              <li>Switch to <a href="https://bit.ly/spark-sql-internals">The Internals of Spark SQL</a>
                <ul>
                  <li><a href="https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Optimizer-CostBasedJoinReorder.html">CostBasedJoinReorder Logical Query Optimization</a></li>
                </ul>
              </li>
            </ol>
          </section>
        </section>

        <section id="recap" data-markdown>
          <textarea data-template>
            ## Recap

            1. [join Operators](#/join-dataset)
            1. [Join Condition](#/join-condition)
            1. [Join Types](#/join-types)
            1. [Join Clause (SQL)](#/join-sql)
            1. [Broadcast Join (aka Map-Side Join)](#/broadcast-join)
            1. [Join Internals and Optimizations](#/join-internals)
          </textarea>
        </section>

        <section style="text-align: left" data-markdown id="questions">
          <textarea data-template>
            # Questions?

            * Read [The Internals of Apache Spark](https://bit.ly/apache-spark-internals)
            * Read [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
            * Read [The Internals of Spark Structured Streaming](https://bit.ly/spark-structured-streaming)
            * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter
            * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski)
          </textarea>
        </section>

      </div>
    </div>

    <script src="reveal.js/lib/js/head.min.js"></script>
    <script src="reveal.js/js/reveal.js"></script>

    <script>
      // More info about config & dependencies:
      // - https://github.com/hakimel/reveal.js#configuration
      // - https://github.com/hakimel/reveal.js#dependencies
      Reveal.initialize({
        controls: true,
        progress: true,
        history: true,
        center: true,
        slideNumber: true,

        transition: 'slide', // none/fade/slide/convex/concave/zoom

        menu: {
          markers: true,
          openSlideNumber: true
        },
        dependencies: [
          { src: 'reveal.js/lib/js/classList.js', condition: function () { return !document.body.classList; } },
          { src: 'reveal.js/plugin/markdown/marked.js' },
          { src: 'reveal.js/plugin/markdown/markdown.js' },
          { src: 'reveal.js/plugin/zoom-js/zoom.js', async: true },
          { src: 'reveal.js/plugin/notes/notes.js', async: true },
          { src: 'reveal.js/plugin/highlight/highlight.js', async: true, callback: function () { hljs.initHighlightingOnLoad(); } }
        ]
      });
    </script>
    <script>
      (function (i, s, o, g, r, a, m) {
        i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function () {
          (i[r].q = i[r].q || []).push(arguments)
        }, i[r].l = 1 * new Date(); a = s.createElement(o),
          m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)
      })(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');

      ga('create', 'UA-45999426-3', 'auto');
      ga('send', 'pageview');

    </script>
  </body>
</html>