Merge branch 'main' of https://github.com/apache/incubator-wayang-web…

…site
apache · Mar 14, 2024 · 1236b85 · 1236b85
2 parents ea6e195 + bc16c11
commit 1236b85
Show file tree

Hide file tree

Showing 4 changed files with 128 additions and 2 deletions.
diff --git a/blog/2024-03-08-wayang-vs-presto.md b/blog/2024-03-08-wayang-vs-presto.md
@@ -5,6 +5,8 @@ authors: [zkaoudi]
 tags: [wayang, presto, trino]
 ---
 
+# Apache Wayang vs. Presto/Trino 
+
 We have been asked several times about the difference between Apache Wayang and Presto/Trino. In this blog post, we will clarify the main differences and how they impact various applications and use cases.
 
 <!--truncate-->
@@ -17,14 +19,18 @@ In contrast, Wayang is a **middleware** for **integrating diverse data platforms
 
 Below you can graphically see the difference between the two systems. Note that not all available data sources or data platforms are illustrated for simplicity reasons.
 
+Below you can see how Wayang integrates data platforms and utilizes them for any data processing required.
 <br/>
 <img width="90%" alt="Wayang" src="/img/blog/wayang-architecture.png" title="Wayang" />  
 <br/>
+<br/>
 
+Below you can see how Trino unifies different data sources and then performs data processing in a distributed manner.
 <br/>
 <img width="90%" alt="Trino" src="/img/blog/trino-architecture.png" title="Trino"/>  
 <br/>
 
+
 I hope this makes it clear now. <br/>
 In fact, Trino can be easily plugged to Wayang as a platform and be seamlessly integrated with other data platforms, as shown below.
 

diff --git a/blog/2024-03-10-kafka-meets-wayang-3.md b/blog/2024-03-10-kafka-meets-wayang-3.md
@@ -0,0 +1,116 @@
+---
+slug: kafka-meets-wayang-3
+title: Apache Kafka meets Apache Wayang - Part 3
+authors: kamir
+tags: [wayang, kafka, spark, cross organization data collaboration]
+---
+
+The third part of this article series is an activity log. 
+Motivated by the learnings from last time, I stated implementing a Kafka Source component and a Kafka Sink component for the Apache Spark platform in Apache Wayang.
+In our previous article we shared the results of the work on the frist Apache Kafka integration using the Java Platform. 
+
+Let's see how it goes this time with Apache Spark.
+
+## The goal of this implementation
+
+We want to process data from Apache Kafka topics, which are hosted on Confluent cloud.
+In our example scenario, the data is available in multiple different clusters, in different regions and owned by different organizations.
+
+We assume, that the operator of our job has been granted appropriate permissions, and the topic owner already provided the configuration properties, including access coordinates and credentials.
+
+![images/image-1.png](images/image-1.png)
+
+This illustration has already been introduced in part one. 
+We focus on **Job 4** in the image and start to implement it. 
+This time we expect the processing load to be higher so that we want to utilize the scalability capabilities of Apache Spark. 
+
+Again, we start with a **WayangContext**, as shown by examples in the Wayang code repository.
+
+```
+WayangContext wayangContext = new WayangContext().with(Spark.basicPlugin());
+```
+We simply switched the backend system towards Apache Spark by using the _WayangContext_ with _Spark.basicPlugin()_.
+The **JavaPlanBuilder** and all other logic of our example job won't be touched.
+
+In order to make this working we will now implement the Mappings and the Operators for the Apache Spark platform module.
+
+## Implementation of Input- and Output Operators
+
+We reuse the Kafka Source and Kafka Sink components which have been created for the JavaKafkaSource and JavaKafkaSink.
+Hence we work with Wayang's Java API.
+
+**Level 1 – Wayang execution plan with abstract operators**
+
+Since the _JavaPlanBuilder_ already exposes the function for selecting a Kafka topic as source
+and the _DataQuantaBuilder_ class exposes the _writeKafkaTopic_ function we can move on quickly. 
+
+Remember, in this API layer we use the Scala programming language, but we utilize the Java classes, implemented in the layer below.
+
+**Level 2 – Wiring between Platform Abstraction and Implementation**
+
+As in the case with the Java Platform, in the second layer we build a bridge between the WayangContext and the PlanBuilders, which work together with DataQuanta and the DataQuantaBuilder.
+
+We must provide the mapping between the abstract components and the specific implementations in this layer.
+
+Therefore, the mappings package in project **wayang-platforms/wayang-spark** has a class _Mappings_ in which 
+our _KafkaTopicSinkMapping_ and _KafkaTopicSourceMapping_ will be registered.
+
+Again, these classes allow the Apache Wayang framework to use the Java implementation of the KafkaTopicSource component (and KafkaTopicSink respectively). 
+
+While the Wayang execution plan uses the higher abstractions, here on the “platform level” we have to link the specific implementation for the target platform. 
+In this case this leads to an Apache Spark job, running on a Spark cluster which is set up by the Apache Wayang framework using the logical components of the execution plan, and the Apache Spark configuration provided at runtime.
+
+A mapping links an operator implementation to the abstraction used in an execution plan. 
+We define two new mappings for our purpose, namely KafkaTopicSourceMapping, and KafkaTopicSinkMapping, both could be reused from last round.
+
+For the Spark platform we simply replace the occurences of _JavaPlatform_ with _SparkPlatform_.
+
+Furthermore, we create an implementation of the _SparkKafkaTopicSource_ and _SparkKafkaTopicSink_.
+
+**Layer 3 – Input/Output Connector Layer**
+
+Let's quickly recap, how does Apache Spark interacts with Apache Kafka? 
+
+There is already an integration which gives us a DataSet using the Spark SQL framework. 
+For Spark Streaming, there is also a Kafka integration using the _SparkSession_'s _readStream()_ function.
+Kafka client properties are provided as key value pairs _k_ and _v_ by using the _option( k, v )_ function.
+For writing into a topic, we can use the _writeStream()_ function.
+But from a first look, it seems to be not the best fit. 
+
+Another approach is possible. 
+We can use simple RDDs to process data previously consumed from Apache Kafka.
+This is a more low-level approach compared to using Datasets with Spark Structured Streaming, 
+and it typically involves using the Kafka RDD API provided by Spark. 
+
+This approach is less common with newer versions of Spark, as Structured Streaming provides a higher-level abstraction that simplifies stream processing. 
+However, we might need that approach for the integration with Apache Wayang. 
+
+For now, we will focus on the lower level approach and plan to consume data from Kafka using a Kafka client, and then
+we parallelize the records in an RDD.
+
+This allows us to reuse _KafkaTopicSource_ and _KafkaTopicSink_ classes we built last time. 
+Those were made specifically for a simple non parallel Java program, using one Consumer and one Producer.
+
+The selected approach does not yet fully take advantage from Spark's parallelism at load time. 
+For higher loads and especially for streaming processing we would have to investigate another approache, using a _SparkStreamingContext_, but this is out of scope for now.
+
+Since we can't reuse the _JavaKafkaTopicSource_ and _JavaKafkaTopicSink_ we rather implement _SparkKafkaTopicSource_ and _SparkKafkaTopicSink_ based on given _SparkTextFileSource_ and _SparkTextFileSink_ which both cary all needed RDD specific logic.
+
+## Summary
+As expected, the integration of Apache Spark with Apache Wayang was no magic, thanks to a fluent API design and a well structured architecture of Apache Wayang. 
+We could easily follow the pattern we have worked out in the previous exercise.
+
+But a bunch of much more interesting work will follow next. 
+More testing, more serialization schemes, and Kafka Schema Registry support should follow, and full parallelization as well.
+
+The code has been submitted to the Apache Wayang repository.
+
+
+## Outlook
+The next part of the article series will cover the real world example as described in image 1.
+We will show how analysts and developers can use the Apache Kafka integration for Apache Wayang to solve cross organizational collaboration issues.
+Therefore, we will bring all puzzles together, and show the full implementation of the multi organizational data collaboration use case.
+
+
+
+
diff --git a/blog/authors.yml b/blog/authors.yml
@@ -1,6 +1,6 @@
 alo.alt:
   name: Alexander Alten
-  title: PPMC Apache Wayang
+  title: (P)PMC Apache Wayang
   url: https://github.com/2pk03
   image_url: https://avatars.githubusercontent.com/u/1323575?v=4
 kamir:
@@ -10,6 +10,6 @@ kamir:
   image_url: https://avatars.githubusercontent.com/u/1241122?v=4
 zkaoudi:
   name: Zoi Kaoudi
-  title: PPMC Apache Wayang
+  title: (P)PMC Apache Wayang
   url: https://github.com/zkaoudi
   image_url: https://avatars.githubusercontent.com/zkaoudi
diff --git a/docusaurus.config.ts b/docusaurus.config.ts
@@ -157,6 +157,10 @@ const config: Config = {
               label: 'Mailing list',
               href: 'https://lists.apache.org/[email protected]',
             },
+            {
+              label: 'LinkedIn',
+              href: 'https://www.linkedin.com/company/apachewayang',
+            },
             {
               label: 'Reddit',
               href: 'https://www.reddit.com/r/ApacheWayang',