[SPARKNLP-1092] Adding support to read HTML files (#14449)

* [SPARKNLP-1089] Adding support to read HTML files * [SPARKNLP-1089] Adding documentation and support for set of URLs in python * [SPARKNLP-1089] Adding input validation in python * [SPARKNLP-1089] Minor fix to notebook
JohnSnowLabs · Dec 9, 2024 · 2482436 · 2482436
1 parent 180a3de
commit 2482436
Show file tree

Hide file tree

Showing 17 changed files with 15,675 additions and 6 deletions.
diff --git a/build.sbt b/build.sbt
@@ -156,7 +156,8 @@ lazy val utilDependencies = Seq(
     exclude ("com.fasterxml.jackson.dataformat", "jackson-dataformat-cbor"),
   greex,
   azureIdentity,
-  azureStorage)
+  azureStorage,
+  jsoup)
 
 lazy val typedDependencyParserDependencies = Seq(junit)
 

diff --git a/examples/python/reader/SparkNLP_HTML_Reader_Demo.ipynb b/examples/python/reader/SparkNLP_HTML_Reader_Demo.ipynb
@@ -0,0 +1,296 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_HTML_Reader_Demo.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "tzcU5p2gdak9"
+   },
+   "source": [
+    "# Introducing HTML reader in SparkNLP\n",
+    "This notebook showcases the newly added  `sparknlp.read().html()` method in Spark NLP that parses HTML content from both local files and real-time URLs into a Spark DataFrame.\n",
+    "\n",
+    "**Key Features:**\n",
+    "- Ability to parse HTML from local directories and URLs.\n",
+    "- Versatile support for varied data ingestion scenarios."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "RFOFhaEedalB"
+   },
+   "source": [
+    "## Setup and Initialization\n",
+    "Let's keep in mind a few things before we start 😊\n",
+    "\n",
+    "Support for reading html files was introduced in `Spark NLP 5.5.2`. Please make sure you have upgraded to the latest Spark NLP release."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- Let's install and setup Spark NLP in Google Colab\n",
+    "- This part is pretty easy via our simple script"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For local files example we will download a couple of HTML files from Spark NLP Github repo:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "ya8qZe00dalC",
+    "outputId": "4399cc35-31d4-459c-bee8-d7eeba3d40cd"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "--2024-11-05 20:02:19--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1089-Support-more-file-types-in-SparkNLP/src/test/resources/reader/html/example-10k.html\n",
+      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
+      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
+      "HTTP request sent, awaiting response... 200 OK\n",
+      "Length: 2456707 (2.3M) [text/plain]\n",
+      "Saving to: ‘html-files/example-10k.html’\n",
+      "\n",
+      "\r",
+      "example-10k.html      0%[                    ]       0  --.-KB/s               \r",
+      "example-10k.html    100%[===================>]   2.34M  --.-KB/s    in 0.01s   \n",
+      "\n",
+      "2024-11-05 20:02:19 (157 MB/s) - ‘html-files/example-10k.html’ saved [2456707/2456707]\n",
+      "\n",
+      "--2024-11-05 20:02:20--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1089-Support-more-file-types-in-SparkNLP/src/test/resources/reader/html/fake-html.html\n",
+      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
+      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
+      "HTTP request sent, awaiting response... 200 OK\n",
+      "Length: 665 [text/plain]\n",
+      "Saving to: ‘html-files/fake-html.html’\n",
+      "\n",
+      "fake-html.html      100%[===================>]     665  --.-KB/s    in 0s      \n",
+      "\n",
+      "2024-11-05 20:02:20 (41.9 MB/s) - ‘html-files/fake-html.html’ saved [665/665]\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!mkdir html-files\n",
+    "!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/html/example-10k.html -P html-files\n",
+    "!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/html/fake-html.html -P html-files"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "EoFI66NAdalE"
+   },
+   "source": [
+    "## Parsing HTML from Local Files\n",
+    "Use the `html()` method to parse HTML content from local directories."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "bAkMjJ1vdalE",
+    "outputId": "c4bb38d4-963d-465b-e222-604dc6b617aa"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Warning::Spark Session already created, some configs may not take.\n",
+      "+--------------------+--------------------+--------------------+\n",
+      "|                path|             content|                html|\n",
+      "+--------------------+--------------------+--------------------+\n",
+      "|file:/content/htm...|<!DOCTYPE html>\\n...|[{Title, 0, My Fi...|\n",
+      "|file:/content/htm...|<?xml  version=\"1...|[{Title, 0, UNITE...|\n",
+      "+--------------------+--------------------+--------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "import sparknlp\n",
+    "html_df = sparknlp.read().html(\"./html-files\")\n",
+    "\n",
+    "html_df.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "VQD2k4E5dalF"
+   },
+   "source": [
+    "## Parsing HTML from Real-Time URLs\n",
+    "Use the `html()` method to fetch and parse HTML content from a URL or a set of URLs in real time."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "MMTGmxLQdalG",
+    "outputId": "57e99213-0fc7-483c-b7c2-695552fc8d73"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Warning::Spark Session already created, some configs may not take.\n",
+      "+--------------------+\n",
+      "|                html|\n",
+      "+--------------------+\n",
+      "|[{Title, 0, Examp...|\n",
+      "+--------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "html_df = sparknlp.read().html(\"https://example.com/\")\n",
+    "html_df.select(\"html\").show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "-psYdzWodalG",
+    "outputId": "544cd7e3-93a6-465a-8b9a-52d487d63b21"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Warning::Spark Session already created, some configs may not take.\n",
+      "+--------------------+--------------------+\n",
+      "|                 url|                html|\n",
+      "+--------------------+--------------------+\n",
+      "|https://www.wikip...|[{Title, 0, Wikip...|\n",
+      "|https://example.com/|[{Title, 0, Examp...|\n",
+      "+--------------------+--------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "htmls_df = sparknlp.read().html([\"https://www.wikipedia.org\", \"https://example.com/\"])\n",
+    "htmls_df.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "FrVKxdySz8pR"
+   },
+   "source": [
+    "### Configuration Parameters"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "QOXXVx5e7Ri1"
+   },
+   "source": [
+    "You can customize the font size used to identify paragraphs that should be treated as titles. By default, the font size is set to 16. However, if your HTML files require a different configuration, you can adjust this parameter accordingly. The example below demonstrates how to modify and work with this setting:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "aNfN0fQC0Vzz",
+    "outputId": "0b849a86-2d59-4415-981a-dcd9a9f7a14a"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Warning::Spark Session already created, some configs may not take.\n",
+      "+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|html                                                                                                                                                                                                                                                                                                                                                                                                                                    |\n",
+      "+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|[{Title, 0, My First Heading, {pageNumber -> 1}}, {Title, 0, My Second Heading, {pageNumber -> 1}}, {NarrativeText, 0, My first paragraph. lorem ipsum dolor set amet. if the cow comes home under the sun how do you fault the cow for it's worn hooves?, {pageNumber -> 1}}, {Title, 0, A Third Heading, {pageNumber -> 1}}, {Table, 0, Column 1 Column 2 Row 1, Cell 1 Row 1, Cell 2 Row 2, Cell 1 Row 2, Cell 2, {pageNumber -> 1}}]|\n",
+      "+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "params = {\"titleFontSize\": \"12\"}\n",
+    "html_df = sparknlp.read(params).html(\"./html-files/fake-html.html\")\n",
+    "html_df.select(\"html\").show(truncate=False)"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/project/Dependencies.scala b/project/Dependencies.scala
@@ -134,5 +134,8 @@ object Dependencies {
   val llamaCppSilicon = "com.johnsnowlabs.nlp" %% "jsl-llamacpp-silicon" % llamaCppVersion
   val llamaCppAarch64 = "com.johnsnowlabs.nlp" %% "jsl-llamacpp-aarch64" % llamaCppVersion
 
+  val jsoupVersion = "1.18.1"
+  val jsoup = "org.jsoup" % "jsoup" % jsoupVersion
+
   /** ------- Dependencies end  ------- */
 }
diff --git a/python/sparknlp/__init__.py b/python/sparknlp/__init__.py
@@ -12,17 +12,20 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.
 
-import sys
 import subprocess
+import sys
 import threading
+
+from pyspark.conf import SparkConf
+from pyspark.context import SparkContext
+from pyspark.java_gateway import launch_gateway
 from pyspark.sql import SparkSession
+
 from sparknlp import annotator
 # Must be declared here one by one or else PretrainedPipeline will fail with AttributeError
 from sparknlp.base import DocumentAssembler, MultiDocumentAssembler, Finisher, EmbeddingsFinisher, TokenAssembler, \
     Doc2Chunk, AudioAssembler, GraphFinisher, ImageAssembler, TableAssembler
-from pyspark.conf import SparkConf
-from pyspark.context import SparkContext
-from pyspark.java_gateway import launch_gateway
+from sparknlp.reader import SparkNLPReader
 
 sys.modules['com.johnsnowlabs.nlp.annotators'] = annotator
 sys.modules['com.johnsnsowlabs.nlp.annotators.tokenizer'] = annotator
@@ -301,6 +304,9 @@ def shutdown(self):
         spark_session = start_without_realtime_output()
         return spark_session
 
+def read(params=None):
+    spark_session = start()
+    return SparkNLPReader(spark_session, params)
 
 def version():
     """Returns the current Spark NLP version.

diff --git a/python/sparknlp/reader/__init__.py b/python/sparknlp/reader/__init__.py
@@ -0,0 +1,15 @@
+#  Copyright 2017-2022 John Snow Labs
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+"""Module for reading different files types."""
+from sparknlp.reader.sparknlp_reader import *