slides/spark-sql-Working-with-Missing-Data.html

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

    <title>Apache Spark™ Workshop | Spark SQL | Working with Missing Data</title>

    <meta name="description" content="Apache Spark™ Workshop | Spark SQL | Working with Missing Data">
    <meta name="author" content="Jacek Laskowski">

    <link rel="stylesheet" href="reveal.js/css/reveal.css">
    <link rel="stylesheet" href="reveal.js/css/theme/beige.css">

    <!-- Theme used for syntax highlighting of code -->
    <link rel="stylesheet" href="reveal.js/lib/css/zenburn.css">

    <!-- Jacek: custom formatting -->
    <link rel="stylesheet" href="revealjs-css/jacek.css">

    <!-- Printing and PDF exports -->
    <script>
      var link = document.createElement('link');
      link.rel = 'stylesheet';
      link.type = 'text/css';
      link.href = window.location.search.match(/print-pdf/gi) ? 'reveal.js/css/print/pdf.css' : 'reveal.js/css/print/paper.css';
      document.getElementsByTagName('head')[0].appendChild(link);
    </script>
  </head>

  <body>
    <div class="reveal">

      <div class="footer">
        <footer style="font-size: small;">
          &copy; <a href="https://medium.com/@jaceklaskowski">Jacek Laskowski</a> 2019 / <a href="https://twitter.com/jaceklaskowski">@JacekLaskowski</a>
          / jacek@japila.pl
        </footer>
      </div>

      <div class="slides">

        <section class="intro" data-transition="zoom" id="home">
          <p>
            <img width="12%" style="background:none; border:none; box-shadow:none;" data-src="images/spark-logo.png">
            <img width="6%" src="images/jacek_laskowski_20141201_512px.png" style="border: 0">
          </p>
          <h1 style="font-size: 3.27em;">Working with <br> Missing Data</h1>
          <h3>Apache Spark 2.4.1 / Spark SQL</h3>

          <h4 style="font-size: smaller;">
            <a href="https://twitter.com/jaceklaskowski">@jaceklaskowski</a> / <a href="https://stackoverflow.com/users/1305344/jacek-laskowski">StackOverflow</a> / <a href="https://github.com/jaceklaskowski">GitHub</a>
            <br>
            The "Internals" Books: <a href="https://bit.ly/apache-spark-internals">Apache Spark</a> / <a href="https://bit.ly/spark-sql-internals">Spark SQL</a> / <a href="https://bit.ly/spark-structured-streaming">Spark Structured Streaming</a>
          </h4>
        </section>

        <section id="agenda" data-markdown>
          <textarea data-template>
            ## Agenda

            1. [What's Missing Data](#/introduction)
            1. [Dataset API for Missing Data](#/dataset-api-for-missing-data)
            1. [Functions for Missing Data](#/functions-for-missing-data)
            1. [Column API for Missing Data](#/column-api-for-missing-data)
            1. [Window Aggregation and Missing Data](#/window-aggregation-and-missing-data)
            1. [Schema Nullability](#/schema-nullability)
            1. [Optimizations](#/optimizations)
            1. [Defining Dataset With Missing Data](#/defining-dataset-with-missing-data)
          </textarea>
        </section>

        <section id="introduction" data-markdown>
          <textarea data-template>
            ## What's Missing Data

            1. **Missing Data** are **null** or **NaN** values in data
              * **Empty data** or **non-existent data**
            1. Can be a result of **LEFT OUTER** or **FULL OUTER** joins
            <pre style="margin-left: 0px;"><code style="width: 900px;" class="lang-text hljs">scala> left.join(right, Seq("id"), "outer").show
            +---+----+-----+
            | id|name| name|
            +---+----+-----+
            |  1| one|jeden|
            |  0|zero| null|
            +---+----+-----+
            </code></pre>
            1. Spark can optimize queries with missing data
          </textarea>
        </section>

        <section>
          <section id="dataset-api-for-missing-data">
            <h1>Dataset API for Missing Data</h1>
          </section>
          <section id="dataframenafunctions">
            <h2>DataFrameNaFunctions</h2>
            <ol>
              <li><a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions">DataFrameNaFunctions</a> is the interface to work with missing data in DataFrames</li>
              <li>Allows for dropping or replacing missing data</li>
              <li>Use <b>Dataset.na</b> operator to access the API</li>
            </ol>
            <pre><code class="hljs" style="word-wrap: break-word;">
              na: DataFrameNaFunctions
            </code></pre>
          </section>
          <section id="dataframenafunctions-api">
            <h2>DataFrameNaFunctions API</h2>
            <ol>
              <li><b>Untyped transformations</b> that return a DataFrame</li>
              <li><b>drop</b> drops rows containing any missing values</li>
              <li><b>fill</b> replaces missing values with a value</li>
              <li><b>replace</b> replaces values matching keys in a replacement map</li>
            </ol>
          </section>
        </section>

        <section>
          <section id="functions-for-missing-data">
            <h1>Functions for Missing Data</h1>
          </section>
          <section id="standard-and-sql-functions" style="font-size: 85%;">
            <h2>Standard and SQL Functions</h2>
            <ol>
              <li>Available functions differ per "execution mode"
                <ul>
                  <li>Dataset API and SQL</li>
                </ul>
              </li>
              <li>Common idiom is to use <b>expr</b> standard function to use SQL-only function outside SQL mode
              <pre><code class="lang-scala" data-trim style="word-wrap: break-word;">
                people.withColumn("expr1", expr("nullif(expr1, expr2)"))
              </code></pre>
              </li>
              <li>Review <a href="https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala">nullExpressions.scala</a> for the definitive list of the Catalyst expressions
                <ul>
                  <li><b>AtLeastNNonNulls</b> (<b>Dataset.drop</b> operator)</li>
                  <li><b>Coalesce</b></li>
                  <li><b>IfNull</b>, <b>NullIf</b></li>
                  <li><b>IsNaN</b>, <b>IsNull</b>, <b>IsNotNull</b></li>
                  <li><b>NaNvl</b>, <b>Nvl</b>, <b>Nvl2</b></li>
                </ul>
              </li>
            </ol>
          </section>
          <section id="standard-functions">
            <h2>Standard Functions</h2>
            <ol>
              <li><b>coalesce</b> selects the first column that is not null, or null if all are null</li>
              <li><b>isnan</b> returns true iff the column is NaN</li>
              <li><b>isnull</b> returns true iff the column is null</li>
              <li><b>nanvl</b> returns col1 if not a NaN, or col2</li>
            </ol>
          </section>
          <section id="standard-aggregate-functions">
            <h2>Standard Aggregate Functions</h2>
            <ol>
              <li><b>first</b> returns the first non-null value when <b>ignoreNulls</b> flag on</li>
              <li><b>last</b></li>
            </ol>
          </section>
          <section id="sql-functions" style="font-size: 90%;">
            <h2>SQL Functions</h2>
            <ol>
              <li><b>ifnull</b> returns expr2 if expr1 is null, or expr1 otherwise</li>
              <li><b>isnan</b> returns true if expr is NaN, or false otherwise</li>
              <li><b>isnotnull</b> returns true if the current expression is NOT null</li>
              <li><b>isnull</b> returns true iff the column is null</li>
              <li><b>nanvl</b> returns col1 if not a NaN, or col2</li>
              <li><b>nullif</b> returns null if expr1 equals to expr2, or expr1 otherwise</li>
              <li><b>nvl</b> returns expr2 if expr1 is null, or expr1</li>
              <li><b>nvl2</b> returns expr2 if expr1 is not null, or expr3</li>
            </ol>
          </section>
        </section>

        <section id="column-api-for-missing-data" style="font-size: 85%;">
          <h2>Column API for Missing Data</h2>
          <ol>
            <li>Filtering (e.g. <b>Dataset.where</b> operator)
              <ul>
                <li><b>isNaN</b> returns true if the current expression is NaN</li>
                <li><b>isNotNull</b> returns true if the current expression is NOT null</li>
                <li><b>isNull</b> returns true if the current expression is null</li>
              </ul>
            </li>
            <li>Sorting (e.g. <b>Dataset.sort</b> operator)
              <ul>
                <li><b>desc_nulls_first</b> returns a descending sort with null values before non-null values (i.e. nulls first)</li>
                <li><b>desc_nulls_last</b> returns a descending sort with null values after non-null values (i.e. nulls last)</li>
              </ul>
              <pre><code class="lang-scala" data-trim style="word-wrap: break-word;">
                people.sort($"age".desc_nulls_last)
              </code></pre>
            </li>
          </ol>
        </section>

        <section>
          <section id="window-aggregation-and-missing-data">
            <h1>Window Aggregation and Missing Data</h1>
          </section>
          <section>
            <h2>Sorting in Window Aggregation</h2>
            <ol>
              <li>Window specification's <b>ORDER BY</b> and nulls</li>
              <li>Ranking functions</li>
            </ol>
          </section>
        </section>

        <section id="schema-nullability">
          <h2>Schema Nullability</h2>
          <ol>
            <li><b>nullable</b> attribute in schema</li>
            <li>Helps query optimizer to handle such columns</li>
            <li>Not enforced and acts as a hint</li>
            <li>If used incorrectly (nulls used for a non-null column), can lead to exceptions difficult to debug</li>
          </ol>
        </section>

        <section id="optimizations">
          <h2>Optimizations</h2>
          <ol>
            <li><b>BooleanSimplification</b></li>
            <li><b>NullPropagation</b></li>
            <li><b>spark.sql.constraintPropagation.enabled</b> (internal) configuration property for Constraint propagation</li>
          </ol>
        </section>

        <section>
          <section id="defining-dataset-with-missing-data">
            <h1>Defining Dataset With Missing Data</h1>
          </section>
          <section>
            <h2>Common but not idiomatic way</h2>
            <pre style="margin-left: 0px;"><code style="width: 900px;" class="lang-text hljs">val names = Seq(
  (0, null.asInstanceOf[String]), // <-- define a missing name
  (1, "hello")).toDF("id", "name")
scala> names.show
+---+-----+
| id| name|
+---+-----+
|  0| null|
|  1|hello|
+---+-----+</code></pre>
          </section>
          <section>
            <h2>Idiomatic way</h2>
            <pre style="margin-left: 0px;"><code style="width: 900px;" class="lang-text hljs">val names = Seq(
  (0, None), // <-- define a missing name
  (1, Some("hello"))).toDF("id", "name")
scala> names.show
+---+-----+
| id| name|
+---+-----+
|  0| null|
|  1|hello|
+---+-----+</code></pre>
          </section>
        </section>

        <section id="recap" data-markdown>
          <textarea data-template>
            ## Recap

            1. [What's Missing Data](#/introduction)
            1. [Dataset API for Missing Data](#/dataset-api-for-missing-data)
            1. [Functions for Missing Data](#/functions-for-missing-data)
            1. [Column API for Missing Data](#/column-api-for-missing-data)
            1. [Window Aggregation and Missing Data](#/window-aggregation-and-missing-data)
            1. [Schema Nullability](#/schema-nullability)
            1. [Optimizations](#/optimizations)
            1. [Defining Dataset With Missing Data](#/defining-dataset-with-missing-data)
          </textarea>
        </section>

        <section style="text-align: left" data-markdown id="questions">
          <textarea data-template>
            # Questions?

            * Read [The Internals of Apache Spark](https://bit.ly/apache-spark-internals)
            * Read [The Internals of Spark SQL](https://bit.ly/spark-sql-internals)
            * Read [The Internals of Spark Structured Streaming](https://bit.ly/spark-structured-streaming)
            * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter
            * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski)
          </textarea>
        </section>

      </div>
    </div>

    <script src="reveal.js/lib/js/head.min.js"></script>
    <script src="reveal.js/js/reveal.js"></script>

    <script>
      // More info about config & dependencies:
      // - https://github.com/hakimel/reveal.js#configuration
      // - https://github.com/hakimel/reveal.js#dependencies
      Reveal.initialize({
        controls: true,
        progress: true,
        history: true,
        center: true,
        slideNumber: true,

        transition: 'slide', // none/fade/slide/convex/concave/zoom

        menu: {
          markers: true,
          openSlideNumber: true
        },
        dependencies: [
          { src: 'reveal.js/lib/js/classList.js', condition: function () { return !document.body.classList; } },
          { src: 'reveal.js/plugin/markdown/marked.js' },
          { src: 'reveal.js/plugin/markdown/markdown.js' },
          { src: 'reveal.js/plugin/zoom-js/zoom.js', async: true },
          { src: 'reveal.js/plugin/notes/notes.js', async: true },
          { src: 'reveal.js/plugin/highlight/highlight.js', async: true, callback: function () { hljs.initHighlightingOnLoad(); } }
        ]
      });
    </script>
    <script>
      (function (i, s, o, g, r, a, m) {
        i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function () {
          (i[r].q = i[r].q || []).push(arguments)
        }, i[r].l = 1 * new Date(); a = s.createElement(o),
          m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)
      })(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');

      ga('create', 'UA-45999426-3', 'auto');
      ga('send', 'pageview');

    </script>
  </body>
</html>