Flesh out Apache Spark Examples documentation #5160
+589
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds example documentation on key components in Apache spark ecosystem:
3-spark-streaming (with kafkaesque & docker-compose)
Spark Streaming allows Spark developers to leverage their existing Spark skills to process real-time data streams, enabling the creation of powerful and scalable streaming applications. Kafka is widely considered the de-facto data source for spark streaming. In this example we demonstrate a simple spark streaming service with kafka & using custom task we define functions to run administrative kafka commands in mill.
4-hello-delta
5-hello-iceberg
Delta Lake and Apache Iceberg are open-source storage layers that bring enhanced features like ACID transactions and schema evolution to data lakes used with Spark, making them more reliable and manageable for large-scale analytics. While Delta lake has very high adoption & is considered default within the data bricks community, Apache iceberg has a rapidly growing adoption. These two storage layers are the most commonly used data storage layers in real world practice.
6-hello-mllib
7-hello-pyspark-mllib
Spark MLlib (Spark's scalable machine learning library) designed to run efficiently on Spark's distributed computing framework, provides tools for common machine learning tasks like regression, classification, & etc. Example are provided in both Python & Scala.
Resolves issue #4592