slides/spark-sql.html

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

    <title>Apache Spark™ Workshop | Spark SQL</title>

    <meta name="description" content="Apache Spark™ Workshop | Spark SQL">
    <meta name="author" content="Jacek Laskowski">

    <link rel="stylesheet" href="reveal4/dist/reset.css">
    <link rel="stylesheet" href="reveal4/dist/reveal.css">
    <link rel="stylesheet" href="reveal4/dist/theme/beige.css">

    <!-- Theme used for syntax highlighting of code -->
    <link rel="stylesheet" href="reveal4/plugin/highlight/monokai.css">

    <!-- Jacek: custom formatting -->
    <link rel="stylesheet" href="revealjs-css/jacek.css">
  </head>

  <body>
    <div class="reveal">

      <div class="footer">
        <footer style="font-size: small;">
          &copy; <a href="https://medium.com/@jaceklaskowski">Jacek Laskowski</a> 2022 / <a
            href="https://twitter.com/jaceklaskowski">@JacekLaskowski</a>
          / jacek@japila.pl
        </footer>
      </div>

      <div class="slides">

        <section class="intro" data-transition="zoom" id="home">
          <p>
            <img width="17%" style="background:none; border:none; box-shadow:none;" data-src="images/spark-logo.png">
            <img width="10%" src="images/jacek_laskowski_20201229_200x200.png" style="border: 0">
          </p>
          <h1>Spark SQL</h1>
          <h3>Apache Spark 3.2</h3>

          <br><br>
          <h4 style="font-size: smaller;">
            <a href="https://twitter.com/jaceklaskowski">@jaceklaskowski</a> / <a
              href="https://stackoverflow.com/users/1305344/jacek-laskowski">StackOverflow</a> / <a
              href="https://github.com/jaceklaskowski">GitHub</a> / <a
              href="https://www.linkedin.com/in/jaceklaskowski/">LinkedIn</a>
            <br>
            The "Internals" Books: <a href="https://books.japila.pl">books.japila.pl</a>
          </h4>
        </section>

        <section id="agenda" data-markdown>
          <textarea data-template>
            ## Agenda

            1. [Spark SQL](#/intro)
            1. [SparkSession](#/sparksession)
            1. [DataSource API](#/datasource)
            1. [Schema](#/schema)
            1. [Ad-hoc Local Datasets](#/ad-hoc-local-datasets)
            1. [Complete Spark SQL Application](#/complete-spark-sql-application)
            1. [Demo: Creating Spark SQL Application](#/demo)
          </textarea>
        </section>

        <section id="intro">
          <h2>Spark SQL</h2>
          <ol style="font-size: 80%">
            <li>Data Processing with Structured Queries on Massive Scale
              <ul>
                <li>SQL-like Relational Queries</li>
                <li>Distributed computations (RDD API)</li>
              </ul>
            </li>
            <li>High-level "front-end" languages
              <ul>
                <li>SQL, Scala, Java, Python, R</li>
              </ul>
            </li>
            <li>Low-level "backend" Logical Operators
              <ul>
                <li>Logical and physical query plans</li>
              </ul>
            </li>
            <li><b>Dataset</b> (and <b>DataFrame</b>) data abstractions</li>
            <li><b>Encoder</b> for storage- and performance optimizations
              <ul>
                <li>Reducing garbage collections</li>
              </ul>
            </li>
            <li>Switch to <a href="https://books.japila.pl/spark-sql-internals/">The Internals of Spark SQL</a>
              <ul>
                <li><a href="https://books.japila.pl/spark-sql-internals/overview/">Spark SQL &mdash; Structured Data Processing with Relational Queries on Massive Scale</a></li>
              </ul>
            </li>
          </ol>
        </section>

        <section id="sparksession">
          <h2>SparkSession</h2>
          <ol>
            <li>The entry point to Spark SQL</li>
            <li>Use <b>SparkSession.builder</b> Fluent API to build one</li>
            <li>Loading datasets using <b>load</b> <small>discussed later</small></li>
            <li>spark-shell gives you one instance as <b>spark</b></li>
            <li>Switch to <a href="https://books.japila.pl/spark-sql-internals/">The Internals of Spark SQL</a>
              <ul>
                <li><a href="https://books.japila.pl/spark-sql-internals/SparkSession/">SparkSession &mdash; The Entry Point to Spark SQL</a></li>
              </ul>
            </li>
          </ol>
        </section>

        <section>
          <section style="font-size: 85%;" id="datasource">
            <h2>DataSource API &mdash; Reading and Writing Datasets</h2>
            <ol>
              <li>Loading datasets using <b>SparkSession.read</b>
                <ul>
                  <li>dataset <span style="font-size: small; vertical-align: super;">(lowercase)</span> = data for processing</li>
                </ul>
              </li>
              <li>Writing Datasets using <b>Dataset.write</b>
                <ul>
                  <li>Dataset <span style="font-size: small; vertical-align: super;">(uppercase)</span> = a distributed computation</li>
                </ul>
              </li>
              <li>Loading and writing operators create source and sink nodes in a data flow graph</li>
              <li>Pluggable API</li>
              <li>Switch to <a href="https://books.japila.pl/spark-sql-internals/">The Internals of Spark SQL</a>
                <ul>
                  <li><a href="https://books.japila.pl/spark-sql-internals/DataSource/">DataSource &mdash; Pluggable Data Provider Framework</a></li>
                </ul>
              </li>
            </ol>
          </section>

          <section style="font-size: 90%;">
            <h2>Reading/Loading Datasets</h2>
            <pre><code data-noescape class="lang-scala hljs">
      val dataset = spark.read.format("csv").load("csvs/*")
            </code></pre>
            <ol>
              <li><b>SparkSession.read</b> &mdash; DataFrameReader
                <ol>
                  <li><b>format</b></li>
                  <li><b>option</b> and <b>options</b></li>
                  <li><b>schema</b></li>
                  <li><b>load</b></li>
                  <li>format-specific loading methods <small>discussed on next slide</small></li>
                </ol>
              </li>
              <li>Switch to <a href="https://books.japila.pl/spark-sql-internals/">The Internals of Spark SQL</a>
                <ul>
                  <li><a href="https://books.japila.pl/spark-sql-internals/DataFrameReader/">DataFrameReader &mdash; Reading Datasets from External Data Sources</a></li>
                </ul>
              </li>
            </ol>
          </section>

          <section>
            <h2>Built-In Data Sources (Formats)</h2>
            <ol>
              <li>File Formats
                <ul>
                  <li><b>csv</b></li>
                  <li><b>json</b></li>
                  <li><b>orc</b></li>
                  <li><b>parquet</b></li>
                  <li><b>text</b> and <b>textFile</b></li>
                </ul>
              </li>
              <li><b>jdbc</b></li>
              <li><b>table</b></li>
            </ol>
          </section>

          <section style="font-size: 90%;">
            <h2>Writing/Saving Datasets</h2>
            <pre><code data-noescape class="lang-scala hljs">
          dataset.write.format("json").save("dailies")
            </code></pre>
            <ol>
              <li><b>DataFrame.write</b> &mdash; DataFrameWriter
                <ol>
                  <li><b>format</b></li>
                  <li><b>mode</b></li>
                  <li><b>option</b> and <b>options</b></li>
                  <li><b>partitionBy</b>, <b>bucketBy</b> and <b>sortBy</b></li>
                  <li><b>insertInto</b>, <b>save</b> and <b>saveAsTable</b></li>
                </ol>
              </li>
              <li>Switch to <a href="https://books.japila.pl/spark-sql-internals/">The Internals of Spark SQL</a>
                <ul>
                  <li><a href="https://books.japila.pl/spark-sql-internals/DataFrameWriter/">DataFrameWriter</a></li>
                </ul>
              </li>
            </ol>
          </section>
        </section>

        <section id="schema">
          <h2>Schema</h2>
          <ol>
            <li>Schema = <b>StructType</b> with one or many <b>StructFields</b></li>
            <li>Implicit (inferred) or explicit</li>
            <li><b>dataset.printSchema</b></li>
            <li>Schema is your case class(es)
              <pre><code data-noescape class="lang-scala hljs">
  case class Person(id: Long, name: String)

  import org.apache.spark.sql.Encoders
  val schema = Encoders.product[Person].schema
              </code></pre>
            </li>
            <li>Switch to <a href="https://books.japila.pl/spark-sql-internals/">The Internals of Spark SQL</a>
              <ul>
                <li><a href="https://books.japila.pl/spark-sql-internals/types/StructType/">StructType</a></li>
              </ul>
            </li>
          </ol>
        </section>

        <section id="ad-hoc-local-datasets">
          <h2>Ad-hoc Local Datasets</h2>
          <ol>
            <li><b>Seq(...).toDF("col1", "col2", ...)</b> for local DataFrames</li>
            <li><b>Seq(...).toDS</b> for local Datasets</li>
            <li>Use <b>import spark.implicits._</b></li>
            <li>All Scala Collections supported (almost)</li>
            <li>Switch to <a href="https://books.japila.pl/spark-sql-internals/">The Internals of Spark SQL</a>
              <ul>
                <li><a href="https://books.japila.pl/spark-sql-internals/implicits/">implicits Object &mdash; Implicits Conversions</a></li>
              </ul>
            </li>
          </ol>
        </section>

        <section data-markdown id="complete-spark-sql-application">
          <textarea data-template>
            ## Complete Spark SQL Application
            ### (From CSV to Parquet)

```scala
// no master URL hard-coded!
// Use spark-submit to specify it at submission time
val spark = SparkSession.builder.getOrCreate
spark
  .read
  .option("header", true)
  .option("inferSchema", true)
  .csv("inventory/*.csv")
  .withColumn("name", upper($"name"))
  .groupBy("categoryId").agg(count("id") as "count")
  .write
  .mode("overwrite")
  .save("reports")
```
          </textarea>
        </section>

        <section id="demo">
          <h2>Demo: Creating Spark SQL Application</h2>
          <ol>
            <li>Use <b>IntelliJ IDEA</b> and <b>sbt</b></li>
            <li>Define Spark SQL dependency in <b>build.sbt</b>
              <ul>
                <li><b>libraryDependencies</b></li>
              </ul>
            </li>
            <li>Write your <b>Spark SQL code</b>
              <ul>
                <li><b>spark.version</b></li>
              </ul>
            </li>
            <li>Execute <b>sbt package</b></li>
            <li>Run the application using <b>spark-submit</b></li>
          </ol>
        </section>

        <section style="text-align: left" data-markdown id="questions">
          <textarea data-template>
            # Questions?

            * Read [The Internals of Apache Spark](https://books.japila.pl/apache-spark-internals/)
            * Read [The Internals of Spark SQL](https://books.japila.pl/spark-sql-internals/)
            * Read [The Internals of Spark Structured Streaming](https://books.japila.pl/spark-structured-streaming-internals/)
            * Follow [@jaceklaskowski](https://twitter.com/jaceklaskowski) on twitter
            * Upvote [my questions and answers on StackOverflow](http://stackoverflow.com/users/1305344/jacek-laskowski)
          </textarea>
        </section>

      </div>
    </div>

    <script src="reveal4/dist/reveal.js"></script>
    <script src="reveal4/plugin/notes/notes.js"></script>
    <script src="reveal4/plugin/markdown/markdown.js"></script>
    <script src="reveal4/plugin/highlight/highlight.js"></script>
    <script src="reveal4/plugin/zoom/zoom.js"></script>
    <script>
      // More info about config & dependencies:
      // - https://github.com/hakimel/reveal.js#configuration
      // - https://github.com/hakimel/reveal.js#dependencies
      Reveal.initialize({
        hash: true,
        pdf: true,
        slideNumber: 'c/t',
        showSlideNumber: 'speaker',
        // Learn about plugins: https://revealjs.com/plugins/
        plugins: [RevealMarkdown, RevealHighlight, RevealNotes, RevealZoom]
      });
    </script>
    <script>
      (function (i, s, o, g, r, a, m) {
      i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function () {
        (i[r].q = i[r].q || []).push(arguments)
      }, i[r].l = 1 * new Date(); a = s.createElement(o),
        m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)
      })(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');

      ga('create', 'UA-45999426-3', 'auto');
      ga('send', 'pageview');

    </script>
  </body>
</html>