Skip to content

Commit

Permalink
[SPARKNLP-1092] Adding support to read HTML files (#14449)
Browse files Browse the repository at this point in the history
* [SPARKNLP-1089] Adding support to read HTML files

* [SPARKNLP-1089] Adding documentation and support for set of URLs in python

* [SPARKNLP-1089] Adding input validation in python

* [SPARKNLP-1089] Minor fix to notebook
  • Loading branch information
danilojsl authored Dec 9, 2024
1 parent 180a3de commit 2482436
Show file tree
Hide file tree
Showing 17 changed files with 15,675 additions and 6 deletions.
3 changes: 2 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,8 @@ lazy val utilDependencies = Seq(
exclude ("com.fasterxml.jackson.dataformat", "jackson-dataformat-cbor"),
greex,
azureIdentity,
azureStorage)
azureStorage,
jsoup)

lazy val typedDependencyParserDependencies = Seq(junit)

Expand Down
296 changes: 296 additions & 0 deletions examples/python/reader/SparkNLP_HTML_Reader_Demo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,296 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_HTML_Reader_Demo.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tzcU5p2gdak9"
},
"source": [
"# Introducing HTML reader in SparkNLP\n",
"This notebook showcases the newly added `sparknlp.read().html()` method in Spark NLP that parses HTML content from both local files and real-time URLs into a Spark DataFrame.\n",
"\n",
"**Key Features:**\n",
"- Ability to parse HTML from local directories and URLs.\n",
"- Versatile support for varied data ingestion scenarios."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RFOFhaEedalB"
},
"source": [
"## Setup and Initialization\n",
"Let's keep in mind a few things before we start 😊\n",
"\n",
"Support for reading html files was introduced in `Spark NLP 5.5.2`. Please make sure you have upgraded to the latest Spark NLP release."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Let's install and setup Spark NLP in Google Colab\n",
"- This part is pretty easy via our simple script"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For local files example we will download a couple of HTML files from Spark NLP Github repo:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ya8qZe00dalC",
"outputId": "4399cc35-31d4-459c-bee8-d7eeba3d40cd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2024-11-05 20:02:19-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1089-Support-more-file-types-in-SparkNLP/src/test/resources/reader/html/example-10k.html\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 2456707 (2.3M) [text/plain]\n",
"Saving to: ‘html-files/example-10k.html’\n",
"\n",
"\r",
"example-10k.html 0%[ ] 0 --.-KB/s \r",
"example-10k.html 100%[===================>] 2.34M --.-KB/s in 0.01s \n",
"\n",
"2024-11-05 20:02:19 (157 MB/s) - ‘html-files/example-10k.html’ saved [2456707/2456707]\n",
"\n",
"--2024-11-05 20:02:20-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1089-Support-more-file-types-in-SparkNLP/src/test/resources/reader/html/fake-html.html\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 665 [text/plain]\n",
"Saving to: ‘html-files/fake-html.html’\n",
"\n",
"fake-html.html 100%[===================>] 665 --.-KB/s in 0s \n",
"\n",
"2024-11-05 20:02:20 (41.9 MB/s) - ‘html-files/fake-html.html’ saved [665/665]\n",
"\n"
]
}
],
"source": [
"!mkdir html-files\n",
"!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/html/example-10k.html -P html-files\n",
"!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/html/fake-html.html -P html-files"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EoFI66NAdalE"
},
"source": [
"## Parsing HTML from Local Files\n",
"Use the `html()` method to parse HTML content from local directories."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "bAkMjJ1vdalE",
"outputId": "c4bb38d4-963d-465b-e222-604dc6b617aa"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning::Spark Session already created, some configs may not take.\n",
"+--------------------+--------------------+--------------------+\n",
"| path| content| html|\n",
"+--------------------+--------------------+--------------------+\n",
"|file:/content/htm...|<!DOCTYPE html>\\n...|[{Title, 0, My Fi...|\n",
"|file:/content/htm...|<?xml version=\"1...|[{Title, 0, UNITE...|\n",
"+--------------------+--------------------+--------------------+\n",
"\n"
]
}
],
"source": [
"import sparknlp\n",
"html_df = sparknlp.read().html(\"./html-files\")\n",
"\n",
"html_df.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VQD2k4E5dalF"
},
"source": [
"## Parsing HTML from Real-Time URLs\n",
"Use the `html()` method to fetch and parse HTML content from a URL or a set of URLs in real time."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "MMTGmxLQdalG",
"outputId": "57e99213-0fc7-483c-b7c2-695552fc8d73"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning::Spark Session already created, some configs may not take.\n",
"+--------------------+\n",
"| html|\n",
"+--------------------+\n",
"|[{Title, 0, Examp...|\n",
"+--------------------+\n",
"\n"
]
}
],
"source": [
"html_df = sparknlp.read().html(\"https://example.com/\")\n",
"html_df.select(\"html\").show()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "-psYdzWodalG",
"outputId": "544cd7e3-93a6-465a-8b9a-52d487d63b21"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning::Spark Session already created, some configs may not take.\n",
"+--------------------+--------------------+\n",
"| url| html|\n",
"+--------------------+--------------------+\n",
"|https://www.wikip...|[{Title, 0, Wikip...|\n",
"|https://example.com/|[{Title, 0, Examp...|\n",
"+--------------------+--------------------+\n",
"\n"
]
}
],
"source": [
"htmls_df = sparknlp.read().html([\"https://www.wikipedia.org\", \"https://example.com/\"])\n",
"htmls_df.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FrVKxdySz8pR"
},
"source": [
"### Configuration Parameters"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QOXXVx5e7Ri1"
},
"source": [
"You can customize the font size used to identify paragraphs that should be treated as titles. By default, the font size is set to 16. However, if your HTML files require a different configuration, you can adjust this parameter accordingly. The example below demonstrates how to modify and work with this setting:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "aNfN0fQC0Vzz",
"outputId": "0b849a86-2d59-4415-981a-dcd9a9f7a14a"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning::Spark Session already created, some configs may not take.\n",
"+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
"|html |\n",
"+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
"|[{Title, 0, My First Heading, {pageNumber -> 1}}, {Title, 0, My Second Heading, {pageNumber -> 1}}, {NarrativeText, 0, My first paragraph. lorem ipsum dolor set amet. if the cow comes home under the sun how do you fault the cow for it's worn hooves?, {pageNumber -> 1}}, {Title, 0, A Third Heading, {pageNumber -> 1}}, {Table, 0, Column 1 Column 2 Row 1, Cell 1 Row 1, Cell 2 Row 2, Cell 1 Row 2, Cell 2, {pageNumber -> 1}}]|\n",
"+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
"\n"
]
}
],
"source": [
"params = {\"titleFontSize\": \"12\"}\n",
"html_df = sparknlp.read(params).html(\"./html-files/fake-html.html\")\n",
"html_df.select(\"html\").show(truncate=False)"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
3 changes: 3 additions & 0 deletions project/Dependencies.scala
Original file line number Diff line number Diff line change
Expand Up @@ -134,5 +134,8 @@ object Dependencies {
val llamaCppSilicon = "com.johnsnowlabs.nlp" %% "jsl-llamacpp-silicon" % llamaCppVersion
val llamaCppAarch64 = "com.johnsnowlabs.nlp" %% "jsl-llamacpp-aarch64" % llamaCppVersion

val jsoupVersion = "1.18.1"
val jsoup = "org.jsoup" % "jsoup" % jsoupVersion

/** ------- Dependencies end ------- */
}
14 changes: 10 additions & 4 deletions python/sparknlp/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,20 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import sys
import subprocess
import sys
import threading

from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.java_gateway import launch_gateway
from pyspark.sql import SparkSession

from sparknlp import annotator
# Must be declared here one by one or else PretrainedPipeline will fail with AttributeError
from sparknlp.base import DocumentAssembler, MultiDocumentAssembler, Finisher, EmbeddingsFinisher, TokenAssembler, \
Doc2Chunk, AudioAssembler, GraphFinisher, ImageAssembler, TableAssembler
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.java_gateway import launch_gateway
from sparknlp.reader import SparkNLPReader

sys.modules['com.johnsnowlabs.nlp.annotators'] = annotator
sys.modules['com.johnsnsowlabs.nlp.annotators.tokenizer'] = annotator
Expand Down Expand Up @@ -301,6 +304,9 @@ def shutdown(self):
spark_session = start_without_realtime_output()
return spark_session

def read(params=None):
spark_session = start()
return SparkNLPReader(spark_session, params)

def version():
"""Returns the current Spark NLP version.
Expand Down
15 changes: 15 additions & 0 deletions python/sparknlp/reader/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright 2017-2022 John Snow Labs
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Module for reading different files types."""
from sparknlp.reader.sparknlp_reader import *
Loading

0 comments on commit 2482436

Please sign in to comment.