diff --git a/docs/content.zh/docs/deployment/advanced/job_status_listener.md b/docs/content.zh/docs/deployment/advanced/job_status_listener.md new file mode 100644 index 0000000000000..251234c6ee0f4 --- /dev/null +++ b/docs/content.zh/docs/deployment/advanced/job_status_listener.md @@ -0,0 +1,80 @@ + +--- +title: "作业状态改变监听器" +nav-title: job-status-listener +nav-parent_id: advanced +nav-pos: 5 +--- + + +## 作业状态改变监听器 +Flink 为用户提供了一个可插入接口,用于注册处理作业状态变化的自定义逻辑,其中提供了有关源/接收器的沿袭信息。这使用户能够实现自己的 Flink 数据血缘报告器,将沿袭信息发送到第三方数据沿袭系统,例如 Datahub 和 Openlineage。 + +每次应用程序发生状态更改时,都会触发作业状态更改监听器。数据沿袭信息包含在 JobCreatedEvent 中。 + +### 为你的自定义丰富器实现插件 + +要实现自定义 JobStatusChangedListener 插件,您需要: + +- 添加自己的 JobStatusChangedListener 通过实现 {{< gh_link file="/flink-core/src/main/java/org/apache/flink/core/execution/JobStatusChangedListener.java" name="JobStatusChangedListener" >}} 接口。 + +- 添加自己的 JobStatusChangedListenerFactory 通过实现 {{< gh_link file="/flink-core/src/main/java/org/apache/flink/core/execution/JobStatusChangedListenerFactory.java" name="JobStatusChangedListenerFactory" >}} 接口。 + +- 添加Java服务条目。创建文件 `META-INF/services/org.apache.flink.core.execution.JobStatusChangedListenerFactory` 其中包含您的作业状态更改侦听器工厂类的类名 (请看 [Java Service Loader](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/ServiceLoader.html) 文档了解更多详情)。 + + +然后,创建一个包含 `JobStatusChangedListener`, `JobStatusChangedListenerFactory`, `META-INF/services/` 以及所有外部依赖项的 Java 库. +在 Flink 发行版的 `plugins/` 中创建一个目录,使用任意名称,例如“job-status-changed-listener”,并将 jar 放入此目录中。 +有关更多详细信息,请参阅 [Flink Plugin]({{< ref "docs/deployment/filesystems/plugins" >}})。 + +JobStatusChangedListenerFactory 示例: + +``` java +package org.apache.flink.test.execution; + +public static class TestingJobStatusChangedListenerFactory + implements JobStatusChangedListenerFactory { + + @Override + public JobStatusChangedListener createListener(Context context) { + return new TestingJobStatusChangedListener(); + } +} +``` + +JobStatusChangedListener 示例: + +``` java +package org.apache.flink.test.execution; + +private static class TestingJobStatusChangedListener implements JobStatusChangedListener { + + @Override + public void onEvent(JobStatusChangedEvent event) { + statusChangedEvents.add(event); + } +} +``` + +### 配置 + +Flink 组件在启动时加载 JobStatusChangedListener 插件。为确保加载 JobStatusChangedListener 的所有实现,所有类名都应定义在 [execution.job-status-changed-listeners]({{< ref "docs/deployment/config#execution.job-status-changed-listeners" >}}). +如果此配置为空,则不会启动任何监听器。例如 +``` + execution.job-status-changed-listeners = org.apache.flink.test.execution.TestingJobStatusChangedListenerFactory +``` diff --git a/docs/content.zh/docs/internals/data_lineage.md b/docs/content.zh/docs/internals/data_lineage.md new file mode 100644 index 0000000000000..82273ed60c60e --- /dev/null +++ b/docs/content.zh/docs/internals/data_lineage.md @@ -0,0 +1,71 @@ +--- +title: 数据血缘 +weight: 12 +type: docs +aliases: + - /zh/internals/data_lineage.html +--- + + +# 原生血缘支持 +数据血缘在数据生态系统中变得越来越重要。随着 Apache Flink 被广泛用于流数据湖中的数据提取和 ETL,我们需要一个端到端的沿袭解决方案,用于包括但不限于以下场景: + - `数据质量保证`: 通过将数据错误追溯到数据管道内的来源来识别和纠正数据不一致. + - `数据治理`: 通过记录数据来源和转换来建立明确的数据所有权和责任制. + - `数据合规`: 通过在整个生命周期中跟踪数据流和转换,确保遵守数据隐私和合规性法规. + - `数据优化`: 识别冗余的数据处理步骤并优化数据流以提高效率. + +Apache Flink 为满足社区需求提供了原生的沿袭支持,它提供了一个内部沿袭数据模型和 [作业状态监听器]({{< ref "docs/deployment/advanced/job_status_listener" >}}) 以便开发人员将血缘元数据集成到外部系统中,例如 [OpenLineage](https://openlineage.io). +在 Flink 运行时创建作业时,包含沿袭图元数据的 JobCreatedEvent 将被发送到这个作业状态监听器里. + +# 血统数据模型 +Flink 原生的 Lineage 接口分为两层定义,第一层是所有 Flink 作业和 Connector 的通用接口,第二层则单独定义了 Table 和 DataStream 的扩展接口,接口和类的关系定义如下图所示。 + +{{< img src="/fig/lineage_interfaces.png" alt="Lineage Data Model" width="80%">}} + +默认情况下,Table 相关的 lineage 接口或类主要在 Flink Table Runtime 中使用,因此 Flink 用户不需要接触这些接口。Flink 社区将逐步支持所有 +常见的连接器,例如 Kafka、JDBC、Cassandra、Hive 等。如果您定义了自定义连接器,则需要自定义 source/sink 实现 LineageVertexProvider 接口。 +在 LineageVertex 中,定义了一个 Lineage Dataset 列表作为 Flink source/sink 的元数据。 + + +```java +@PublicEvolving +public interface LineageVertexProvider { + LineageVertex getLineageVertex(); +} +``` + +接口详细信息请参考 [FLIP-314](https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener). + +# Naming Conventions +对于每个 Lineage Dataset,我们需要定义它自己的名称和命名空间,以区分 Flink 应用程序连接器中使用的不同数据存储和相应实例。 + +| Data Store | Connector Type | Namespace | Name | +|------------|-----------------|----------------------------------------|----------------------------------------------------------| +| Kafka | Kafka Connector | kafka://{bootstrap server host}:{port} | topic | +| MySQL | JDBC Connector | mysql://{host}:{port} | {database}.{table} | +| Sql Server | JDBC Connector | sqlserver://{host}:{port} | {database}.{table} | +| Postgres | JDBC Connector | postgres://{host}:{port} | {database}.{schema}.{table} | +| Oracle | JDBC Connector | oracle://{host}:{port} | {serviceName}.{schema}.{table} or {sid}.{schema}.{table} | +| Trino | JDBC Connector | trino://{host}:{port} | {catalog}.{schema}.{table} | +| OceanBase | JDBC Connector | oceanbase://{host}:{port} | {database}.{table} | +| DB2 | JDBC Connector | db2://{host}:{port} | {database}.{table} | +| CrateDB | JDBC Connector | cratedb://{host}:{port} | {database}.{table} | + +如果您想为此处未列出的 Flink 连接器的血统集成做出贡献,请在Flink连接器的代码库中完成开发,然后更新上表。 diff --git a/docs/content/docs/deployment/advanced/job_status_listener.md b/docs/content/docs/deployment/advanced/job_status_listener.md index 723cc86259468..cc868e3f93d05 100644 --- a/docs/content/docs/deployment/advanced/job_status_listener.md +++ b/docs/content/docs/deployment/advanced/job_status_listener.md @@ -28,7 +28,7 @@ This enables users to implement their own flink lineage reporter to send lineage The job status changed listeners are triggered every time status change happened for the application. The data lineage info is included in the JobCreatedEvent. -### Implement a plugin for your custom enricher +### Implement a plugin for Job status changed listener To implement a custom JobStatusChangedListener plugin, you need to: diff --git a/docs/content/docs/internals/data_lineage.md b/docs/content/docs/internals/data_lineage.md new file mode 100644 index 0000000000000..930b37612edc2 --- /dev/null +++ b/docs/content/docs/internals/data_lineage.md @@ -0,0 +1,74 @@ +--- +title: Data Lineage +weight: 12 +type: docs +aliases: + - /internals/data_lineage.html +--- + + +# Native Lineage Support +As organisations look to govern their data ecosystems; understanding data lineage, where data is coming from and going to, becomes critical. As Apache Flink is widely used for data ingestion and ETL in Streaming Data Lakes, we need +an end to end lineage solution for scenarios including but not limited to: + - `Data Quality Assurance`: Identifying and rectifying data inconsistencies by tracing data errors back to their origin within the data pipeline. + - `Data Governance`: Establishing clear data ownership and accountability by documenting data origins and transformations. + - `Regulatory Compliance`: Ensuring adherence to data privacy and compliance regulations by tracking data flow and transformations throughout its lifecycle. + - `Data Optimization`: Identifying redundant data processing steps and optimizing data flows to improve efficiency. + +Apache Flink provides a native lineage support by providing an internal lineage data model and [Job Status Listener]({{< ref "docs/deployment/advanced/job_status_listener" >}}) for +developer to integrate lineage metadata into external lineage system, for example [OpenLineage](https://openlineage.io). When a job is created in Flink runtime, the JobCreatedEvent +contains the Lineage Graph metadata that will be sent to Job Status Listeners. + +# Lineage Data Model +Flink native lineage interfaces are defined in two layers. The first layer is the generic interface for all Flink jobs and connector, and the second layer defines +the extended interfaces for Table and DataStream independently. The interface and class relationships are defined in the diagram below. + +{{< img src="/fig/lineage_interfaces.png" alt="Lineage Data Model" width="80%">}} + +By default, Table related lineage interfaces or classes are used in Flink Table environment, thus Flink users doesn't need to touch these interfaces. The Flink community will gradually support all +of the common connectors, such as Kafka, JDBC, Cassandra, Hive. If you have a customized connector defined, you need to have customized source/sink implementations of the LineageVertexProvider interface. +Within a LineageVertex, a list of Lineage Datasets are defined as metadata for Flink source/sink. + + +```java +@PublicEvolving +public interface LineageVertexProvider { + LineageVertex getLineageVertex(); +} +``` + +For the interface details, please refer to [FLIP-314](https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener). + +# Naming Conventions +For each of the Lineage Dataset, we need to define its name and namespace, to distinguish different data stores and corresponding dynamic table associated with a Flink connector. + +| Data Store | Connector Type | Namespace | Name | +|------------|-----------------|----------------------------------------|----------------------------------------------------------| +| Kafka | Kafka Connector | kafka://{bootstrap server host}:{port} | topic | +| MySQL | JDBC Connector | mysql://{host}:{port} | {database}.{table} | +| Sql Server | JDBC Connector | sqlserver://{host}:{port} | {database}.{table} | +| Postgres | JDBC Connector | postgres://{host}:{port} | {database}.{schema}.{table} | +| Oracle | JDBC Connector | oracle://{host}:{port} | {serviceName}.{schema}.{table} or {sid}.{schema}.{table} | +| Trino | JDBC Connector | trino://{host}:{port} | {catalog}.{schema}.{table} | +| OceanBase | JDBC Connector | oceanbase://{host}:{port} | {database}.{table} | +| DB2 | JDBC Connector | db2://{host}:{port} | {database}.{table} | +| CrateDB | JDBC Connector | cratedb://{host}:{port} | {database}.{table} | + +If you would like to contribute to lineage integration with Flink Connectors that are not listed here, please finish the development in the connector repository and then update the table above. diff --git a/docs/static/fig/lineage_interfaces.png b/docs/static/fig/lineage_interfaces.png new file mode 100644 index 0000000000000..40718118d8009 Binary files /dev/null and b/docs/static/fig/lineage_interfaces.png differ