Skip to content

Flesh out Apache Spark Examples documentation #5160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

monyedavid
Copy link
Contributor

@monyedavid monyedavid commented May 20, 2025

This PR adds example documentation on key components in Apache spark ecosystem:

  • 3-spark-streaming (with kafkaesque & docker-compose)
    Spark Streaming allows Spark developers to leverage their existing Spark skills to process real-time data streams, enabling the creation of powerful and scalable streaming applications. Kafka is widely considered the de-facto data source for spark streaming. In this example we demonstrate a simple spark streaming service with kafka & using custom task we define functions to run administrative kafka commands in mill.

  • 4-hello-delta

  • 5-hello-iceberg
    Delta Lake and Apache Iceberg are open-source storage layers that bring enhanced features like ACID transactions and schema evolution to data lakes used with Spark, making them more reliable and manageable for large-scale analytics. While Delta lake has very high adoption & is considered default within the data bricks community, Apache iceberg has a rapidly growing adoption. These two storage layers are the most commonly used data storage layers in real world practice.

  • 6-hello-mllib

  • 7-hello-pyspark-mllib
    Spark MLlib (Spark's scalable machine learning library) designed to run efficiently on Spark's distributed computing framework, provides tools for common machine learning tasks like regression, classification, & etc. Example are provided in both Python & Scala.

Resolves issue #4592

@lihaoyi
Copy link
Member

lihaoyi commented May 28, 2025

At a first glance the examples are reasonable. Next step would be fleshing out the english explanations to highlight the relevant parts of each example and explaining why they are important and necessary for each particular example.

@lihaoyi
Copy link
Member

lihaoyi commented May 28, 2025

Perhaps one more thing is required at a PR-level: a convincing explanation in the PR description why these are the examples that are most important for spark developers, and not the dozens of other possible examples we could come up with. I do not have a spark background, so you'll need to tell me why this choice of examples is the most useful for spark users in a way I can understand and be convinced by

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants