[FLINK-35745] add documentation for flink lineage

apache · Dec 8, 2024 · 52bf157 · 52bf157
1 parent cbe2656
commit 52bf157
Show file tree

Hide file tree

Showing 5 changed files with 226 additions and 1 deletion.
diff --git a/docs/content.zh/docs/deployment/advanced/job_status_listener.md b/docs/content.zh/docs/deployment/advanced/job_status_listener.md
@@ -0,0 +1,80 @@
+
+---
+title: "作业状态改变监听器"
+nav-title: job-status-listener
+nav-parent_id: advanced
+nav-pos: 5
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+  http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## 作业状态改变监听器
+Flink 为用户提供了一个可插入接口，用于注册处理作业状态变化的自定义逻辑，其中提供了有关源/接收器的沿袭信息。这使用户能够实现自己的 Flink 数据血缘报告器，将沿袭信息发送到第三方数据沿袭系统，例如 Datahub 和 Openlineage。
+
+每次应用程序发生状态更改时，都会触发作业状态更改监听器。数据沿袭信息包含在 JobCreatedEvent 中。
+
+### 为你的自定义丰富器实现插件
+
+要实现自定义 JobStatusChangedListener 插件，您需要：
+
+- 添加自己的 JobStatusChangedListener 通过实现 {{< gh_link file="/flink-core/src/main/java/org/apache/flink/core/execution/JobStatusChangedListener.java" name="JobStatusChangedListener" >}} 接口。
+
+- 添加自己的 JobStatusChangedListenerFactory 通过实现 {{< gh_link file="/flink-core/src/main/java/org/apache/flink/core/execution/JobStatusChangedListenerFactory.java" name="JobStatusChangedListenerFactory" >}} 接口。
+
+- 添加Java服务条目。创建文件 `META-INF/services/org.apache.flink.core.execution.JobStatusChangedListenerFactory` 其中包含您的作业状态更改侦听器工厂类的类名 (请看 [Java Service Loader](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/ServiceLoader.html) 文档了解更多详情)。
+
+
+然后，创建一个包含 `JobStatusChangedListener`, `JobStatusChangedListenerFactory`, `META-INF/services/` 以及所有外部依赖项的 Java 库.
+在 Flink 发行版的 `plugins/` 中创建一个目录，使用任意名称，例如“job-status-changed-listener”，并将 jar 放入此目录中。
+有关更多详细信息，请参阅 [Flink Plugin]({{< ref "docs/deployment/filesystems/plugins" >}})。
+
+JobStatusChangedListenerFactory 示例:
+
+``` java
+package org.apache.flink.test.execution;
+
+public static class TestingJobStatusChangedListenerFactory
+        implements JobStatusChangedListenerFactory {
+
+    @Override
+    public JobStatusChangedListener createListener(Context context) {
+        return new TestingJobStatusChangedListener();
+    }
+}
+```
+
+JobStatusChangedListener 示例:
+
+``` java
+package org.apache.flink.test.execution;
+
+private static class TestingJobStatusChangedListener implements JobStatusChangedListener {
+
+    @Override
+    public void onEvent(JobStatusChangedEvent event) {
+        statusChangedEvents.add(event);
+    }
+}
+```
+
+### 配置
+
+Flink 组件在启动时加载 JobStatusChangedListener 插件。为确保加载 JobStatusChangedListener 的所有实现，所有类名都应定义在 [execution.job-status-changed-listeners]({{< ref "docs/deployment/config#execution.job-status-changed-listeners" >}}).
+如果此配置为空，则不会启动任何监听器。例如
+```
+    execution.job-status-changed-listeners = org.apache.flink.test.execution.TestingJobStatusChangedListenerFactory
+```
diff --git a/docs/content.zh/docs/internals/data_lineage.md b/docs/content.zh/docs/internals/data_lineage.md
@@ -0,0 +1,71 @@
+---
+title: 数据血缘
+weight: 12
+type: docs
+aliases:
+  - /zh/internals/data_lineage.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# 原生血缘支持
+数据血缘在数据生态系统中变得越来越重要。随着 Apache Flink 被广泛用于流数据湖中的数据提取和 ETL，我们需要一个端到端的沿袭解决方案，用于包括但不限于以下场景:
+  - `数据质量保证`: 通过将数据错误追溯到数据管道内的来源来识别和纠正数据不一致.
+  - `数据治理`： 通过记录数据来源和转换来建立明确的数据所有权和责任制.
+  - `数据合规`: 通过在整个生命周期中跟踪数据流和转换，确保遵守数据隐私和合规性法规.
+  - `数据优化`: 识别冗余的数据处理步骤并优化数据流以提高效率.
+
+Apache Flink 为满足社区需求提供了原生的沿袭支持，它提供了一个内部沿袭数据模型和 [作业状态监听器]({{< ref "docs/deployment/advanced/job_status_listener" >}}) 以便开发人员将血缘元数据集成到外部系统中，例如 [OpenLineage](https://openlineage.io). 
+在 Flink 运行时创建作业时，包含沿袭图元数据的 JobCreatedEvent 将被发送到这个作业状态监听器里.
+
+# 血统数据模型
+Flink 原生的 Lineage 接口分为两层定义，第一层是所有 Flink 作业和 Connector 的通用接口，第二层则单独定义了 Table 和 DataStream 的扩展接口，接口和类的关系定义如下图所示。
+
+{{< img src="/fig/lineage_interfaces.png" alt="Lineage Data Model" width="80%">}}
+
+默认情况下，Table 相关的 lineage 接口或类主要在 Flink Table Runtime 中使用，因此 Flink 用户不需要接触这些接口。Flink 社区将逐步支持所有
+常见的连接器，例如 Kafka、JDBC、Cassandra、Hive 等。如果您定义了自定义连接器，则需要自定义 source/sink 实现 LineageVertexProvider 接口。
+在 LineageVertex 中，定义了一个 Lineage Dataset 列表作为 Flink source/sink 的元数据。
+
+
+```java
+@PublicEvolving
+public interface LineageVertexProvider {
+  LineageVertex getLineageVertex();
+}
+```
+
+接口详细信息请参考 [FLIP-314](https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener).
+
+# Naming Conventions
+对于每个 Lineage Dataset，我们需要定义它自己的名称和命名空间，以区分 Flink 应用程序连接器中使用的不同数据存储和相应实例。
+
+| Data Store | Connector Type  | Namespace                              | Name                                                     | 
+|------------|-----------------|----------------------------------------|----------------------------------------------------------|
+| Kafka      | Kafka Connector | kafka://{bootstrap server host}:{port} | topic                                                    |
+| MySQL      | JDBC Connector  | mysql://{host}:{port}                  | {database}.{table}                                       | 
+| Sql Server | JDBC Connector  | sqlserver://{host}:{port}              | {database}.{table}                                       | 
+| Postgres   | JDBC Connector  | postgres://{host}:{port}               | {database}.{schema}.{table}                              | 
+| Oracle     | JDBC Connector  | oracle://{host}:{port}                 | {serviceName}.{schema}.{table} or {sid}.{schema}.{table} | 
+| Trino      | JDBC Connector  | trino://{host}:{port}                  | {catalog}.{schema}.{table}                               | 
+| OceanBase  | JDBC Connector  | oceanbase://{host}:{port}              | {database}.{table}                                       | 
+| DB2        | JDBC Connector  | db2://{host}:{port}                    | {database}.{table}                                       | 
+| CrateDB    | JDBC Connector  | cratedb://{host}:{port}                | {database}.{table}                                       | 
+
+这是一个正在更新的表。当特定连接器的血统集成完成后，添加越来越多的命名信息将被添加进入这个表中。
diff --git a/docs/content/docs/deployment/advanced/job_status_listener.md b/docs/content/docs/deployment/advanced/job_status_listener.md
@@ -28,7 +28,7 @@ This enables users to implement their own flink lineage reporter to send lineage
 
 The job status changed listeners are triggered every time status change happened for the application. The data lineage info is included in the JobCreatedEvent.
 
-### Implement a plugin for your custom enricher
+### Implement a plugin for Job status changed listener
 
 To implement a custom JobStatusChangedListener plugin, you need to:
 

diff --git a/docs/content/docs/internals/data_lineage.md b/docs/content/docs/internals/data_lineage.md
@@ -0,0 +1,74 @@
+---
+title: Data Lineage
+weight: 12
+type: docs
+aliases:
+  - /internals/data_lineage.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Native Lineage Support
+Data lineage has gain more and more criticality in data ecosystem. As Apache Flink is widely used for data ingestion and ETL in Streaming Data Lake, we need 
+an end to end lineage solution for scenarios including but not limited to:
+  - `Data Quality Assurance`: Identifying and rectifying data inconsistencies by tracing data errors back to their origin within the data pipeline.
+  - `Data Governance`： Establishing clear data ownership and accountability by documenting data origins and transformations.
+  - `Regulatory Compliance`: Ensuring adherence to data privacy and compliance regulations by tracking data flow and transformations throughout its lifecycle.
+  - `Data Optimization`: Identifying redundant data processing steps and optimizing data flows to improve efficiency.
+
+Apache Flink provides a native lineage support for the community requirement by providing an internal lineage data model and [Job Status Listener]({{< ref "docs/deployment/advanced/job_status_listener" >}}) for
+developer to integrate lineage metadata into external lineage system, for example [OpenLineage](https://openlineage.io). When a job is created in Flink runtime, the JobCreatedEvent 
+contains the Lineage Graph metadata will be sent to Job Status Listeners.
+
+# Lineage Data Model
+Flink native lineage interfaces are defined in two layers. The first layer is the generic interface for all Flink jobs and connector, and the second layer defines
+the extended interfaces for Table and DataStream independently. The interface and class relationship are defined in the diagram below.
+
+{{< img src="/fig/lineage_interfaces.png" alt="Lineage Data Model" width="80%">}}
+
+By default, Table related lineage interfaces or classes are mainly used in Flink Table Runtime, thus Flink users doesn't need to touch these interfaces. Flink community will gradually support all
+of common connectors, such as Kafka, JDBC, Cassandra, Hive and so on. If you have customized connector defined, you need to have customized source/sink implements the LineageVertexProvider interface.
+Within a LineageVertex, a list of Lineage Dataset is defined as metadata for Flink source/sink. 
+
+
+```java
+@PublicEvolving
+public interface LineageVertexProvider {
+  LineageVertex getLineageVertex();
+}
+```
+
+For the interface details, please refer to [FLIP-314](https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener).
+
+# Naming Conventions
+For each of Lineage Dataset, we need to define its own name and namespace to distinguish different data store and corresponding instance used in the connector of a Flink application. 
+
+| Data Store | Connector Type  | Namespace                              | Name                                                     | 
+|------------|-----------------|----------------------------------------|----------------------------------------------------------|
+| Kafka      | Kafka Connector | kafka://{bootstrap server host}:{port} | topic                                                    |
+| MySQL      | JDBC Connector  | mysql://{host}:{port}                  | {database}.{table}                                       | 
+| Sql Server | JDBC Connector  | sqlserver://{host}:{port}              | {database}.{table}                                       | 
+| Postgres   | JDBC Connector  | postgres://{host}:{port}               | {database}.{schema}.{table}                              | 
+| Oracle     | JDBC Connector  | oracle://{host}:{port}                 | {serviceName}.{schema}.{table} or {sid}.{schema}.{table} | 
+| Trino      | JDBC Connector  | trino://{host}:{port}                  | {catalog}.{schema}.{table}                               | 
+| OceanBase  | JDBC Connector  | oceanbase://{host}:{port}              | {database}.{table}                                       | 
+| DB2        | JDBC Connector  | db2://{host}:{port}                    | {database}.{table}                                       | 
+| CrateDB    | JDBC Connector  | cratedb://{host}:{port}                | {database}.{table}                                       | 
+
+It is a running table. More and more naming info will be added after lineage integration is finished for a specific connector.
diff --git a/docs/static/fig/lineage_interfaces.png b/docs/static/fig/lineage_interfaces.png