Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-35745] add documentation for flink lineage #25762

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions docs/content.zh/docs/deployment/advanced/job_status_listener.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@

---
title: "作业状态改变监听器"
nav-title: job-status-listener
nav-parent_id: advanced
nav-pos: 5
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

## 作业状态改变监听器
Flink 为用户提供了一个可插入接口,用于注册处理作业状态变化的自定义逻辑,其中提供了有关源/接收器的沿袭信息。这使用户能够实现自己的 Flink 数据血缘报告器,将沿袭信息发送到第三方数据沿袭系统,例如 Datahub 和 Openlineage。

每次应用程序发生状态更改时,都会触发作业状态更改监听器。数据沿袭信息包含在 JobCreatedEvent 中。

### 为你的自定义丰富器实现插件

要实现自定义 JobStatusChangedListener 插件,您需要:

- 添加自己的 JobStatusChangedListener 通过实现 {{< gh_link file="/flink-core/src/main/java/org/apache/flink/core/execution/JobStatusChangedListener.java" name="JobStatusChangedListener" >}} 接口。

- 添加自己的 JobStatusChangedListenerFactory 通过实现 {{< gh_link file="/flink-core/src/main/java/org/apache/flink/core/execution/JobStatusChangedListenerFactory.java" name="JobStatusChangedListenerFactory" >}} 接口。

- 添加Java服务条目。创建文件 `META-INF/services/org.apache.flink.core.execution.JobStatusChangedListenerFactory` 其中包含您的作业状态更改侦听器工厂类的类名 (请看 [Java Service Loader](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/ServiceLoader.html) 文档了解更多详情)。


然后,创建一个包含 `JobStatusChangedListener`, `JobStatusChangedListenerFactory`, `META-INF/services/` 以及所有外部依赖项的 Java 库.
在 Flink 发行版的 `plugins/` 中创建一个目录,使用任意名称,例如“job-status-changed-listener”,并将 jar 放入此目录中。
有关更多详细信息,请参阅 [Flink Plugin]({{< ref "docs/deployment/filesystems/plugins" >}})。

JobStatusChangedListenerFactory 示例:

``` java
package org.apache.flink.test.execution;

public static class TestingJobStatusChangedListenerFactory
implements JobStatusChangedListenerFactory {

@Override
public JobStatusChangedListener createListener(Context context) {
return new TestingJobStatusChangedListener();
}
}
```

JobStatusChangedListener 示例:

``` java
package org.apache.flink.test.execution;

private static class TestingJobStatusChangedListener implements JobStatusChangedListener {

@Override
public void onEvent(JobStatusChangedEvent event) {
statusChangedEvents.add(event);
}
}
```

### 配置

Flink 组件在启动时加载 JobStatusChangedListener 插件。为确保加载 JobStatusChangedListener 的所有实现,所有类名都应定义在 [execution.job-status-changed-listeners]({{< ref "docs/deployment/config#execution.job-status-changed-listeners" >}}).
如果此配置为空,则不会启动任何监听器。例如
```
execution.job-status-changed-listeners = org.apache.flink.test.execution.TestingJobStatusChangedListenerFactory
```
71 changes: 71 additions & 0 deletions docs/content.zh/docs/internals/data_lineage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: 数据血缘
weight: 12
type: docs
aliases:
- /zh/internals/data_lineage.html
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# 原生血缘支持
数据血缘在数据生态系统中变得越来越重要。随着 Apache Flink 被广泛用于流数据湖中的数据提取和 ETL,我们需要一个端到端的沿袭解决方案,用于包括但不限于以下场景:
- `数据质量保证`: 通过将数据错误追溯到数据管道内的来源来识别和纠正数据不一致.
- `数据治理`: 通过记录数据来源和转换来建立明确的数据所有权和责任制.
- `数据合规`: 通过在整个生命周期中跟踪数据流和转换,确保遵守数据隐私和合规性法规.
- `数据优化`: 识别冗余的数据处理步骤并优化数据流以提高效率.

Apache Flink 为满足社区需求提供了原生的沿袭支持,它提供了一个内部沿袭数据模型和 [作业状态监听器]({{< ref "docs/deployment/advanced/job_status_listener" >}}) 以便开发人员将血缘元数据集成到外部系统中,例如 [OpenLineage](https://openlineage.io).
在 Flink 运行时创建作业时,包含沿袭图元数据的 JobCreatedEvent 将被发送到这个作业状态监听器里.

# 血统数据模型
Flink 原生的 Lineage 接口分为两层定义,第一层是所有 Flink 作业和 Connector 的通用接口,第二层则单独定义了 Table 和 DataStream 的扩展接口,接口和类的关系定义如下图所示。

{{< img src="/fig/lineage_interfaces.png" alt="Lineage Data Model" width="80%">}}

默认情况下,Table 相关的 lineage 接口或类主要在 Flink Table Runtime 中使用,因此 Flink 用户不需要接触这些接口。Flink 社区将逐步支持所有
常见的连接器,例如 Kafka、JDBC、Cassandra、Hive 等。如果您定义了自定义连接器,则需要自定义 source/sink 实现 LineageVertexProvider 接口。
在 LineageVertex 中,定义了一个 Lineage Dataset 列表作为 Flink source/sink 的元数据。


```java
@PublicEvolving
public interface LineageVertexProvider {
LineageVertex getLineageVertex();
}
```

接口详细信息请参考 [FLIP-314](https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener).

# Naming Conventions
对于每个 Lineage Dataset,我们需要定义它自己的名称和命名空间,以区分 Flink 应用程序连接器中使用的不同数据存储和相应实例。

| Data Store | Connector Type | Namespace | Name |
|------------|-----------------|----------------------------------------|----------------------------------------------------------|
| Kafka | Kafka Connector | kafka://{bootstrap server host}:{port} | topic |
| MySQL | JDBC Connector | mysql://{host}:{port} | {database}.{table} |
| Sql Server | JDBC Connector | sqlserver://{host}:{port} | {database}.{table} |
| Postgres | JDBC Connector | postgres://{host}:{port} | {database}.{schema}.{table} |
| Oracle | JDBC Connector | oracle://{host}:{port} | {serviceName}.{schema}.{table} or {sid}.{schema}.{table} |
| Trino | JDBC Connector | trino://{host}:{port} | {catalog}.{schema}.{table} |
| OceanBase | JDBC Connector | oceanbase://{host}:{port} | {database}.{table} |
| DB2 | JDBC Connector | db2://{host}:{port} | {database}.{table} |
| CrateDB | JDBC Connector | cratedb://{host}:{port} | {database}.{table} |

这是一个正在更新的表。当特定连接器的血统集成完成后,添加越来越多的命名信息将被添加进入这个表中。
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ This enables users to implement their own flink lineage reporter to send lineage

The job status changed listeners are triggered every time status change happened for the application. The data lineage info is included in the JobCreatedEvent.

### Implement a plugin for your custom enricher
### Implement a plugin for Job status changed listener

To implement a custom JobStatusChangedListener plugin, you need to:

Expand Down
74 changes: 74 additions & 0 deletions docs/content/docs/internals/data_lineage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: Data Lineage
weight: 12
type: docs
aliases:
- /internals/data_lineage.html
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Native Lineage Support
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved
Data lineage has gain more and more criticality in data ecosystem. As Apache Flink is widely used for data ingestion and ETL in Streaming Data Lake, we need
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved
an end to end lineage solution for scenarios including but not limited to:
- `Data Quality Assurance`: Identifying and rectifying data inconsistencies by tracing data errors back to their origin within the data pipeline.
- `Data Governance`: Establishing clear data ownership and accountability by documenting data origins and transformations.
- `Regulatory Compliance`: Ensuring adherence to data privacy and compliance regulations by tracking data flow and transformations throughout its lifecycle.
- `Data Optimization`: Identifying redundant data processing steps and optimizing data flows to improve efficiency.

Apache Flink provides a native lineage support for the community requirement by providing an internal lineage data model and [Job Status Listener]({{< ref "docs/deployment/advanced/job_status_listener" >}}) for
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved
developer to integrate lineage metadata into external lineage system, for example [OpenLineage](https://openlineage.io). When a job is created in Flink runtime, the JobCreatedEvent
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved
contains the Lineage Graph metadata will be sent to Job Status Listeners.
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved

# Lineage Data Model
Flink native lineage interfaces are defined in two layers. The first layer is the generic interface for all Flink jobs and connector, and the second layer defines
Copy link
Contributor

@davidradl davidradl Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a picture showing the layers at a component level.

the extended interfaces for Table and DataStream independently. The interface and class relationship are defined in the diagram below.
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved

{{< img src="/fig/lineage_interfaces.png" alt="Lineage Data Model" width="80%">}}

By default, Table related lineage interfaces or classes are mainly used in Flink Table Runtime, thus Flink users doesn't need to touch these interfaces. Flink community will gradually support all
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved
of common connectors, such as Kafka, JDBC, Cassandra, Hive and so on. If you have customized connector defined, you need to have customized source/sink implements the LineageVertexProvider interface.
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved
Within a LineageVertex, a list of Lineage Dataset is defined as metadata for Flink source/sink.
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved


```java
@PublicEvolving
public interface LineageVertexProvider {
LineageVertex getLineageVertex();
}
```

For the interface details, please refer to [FLIP-314](https://cwiki.apache.org/confluence/display/FLINK/FLIP-314%3A+Support+Customized+Job+Lineage+Listener).

# Naming Conventions
For each of Lineage Dataset, we need to define its own name and namespace to distinguish different data store and corresponding instance used in the connector of a Flink application.
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved

| Data Store | Connector Type | Namespace | Name |
|------------|-----------------|----------------------------------------|----------------------------------------------------------|
| Kafka | Kafka Connector | kafka://{bootstrap server host}:{port} | topic |
| MySQL | JDBC Connector | mysql://{host}:{port} | {database}.{table} |
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved
| Sql Server | JDBC Connector | sqlserver://{host}:{port} | {database}.{table} |
| Postgres | JDBC Connector | postgres://{host}:{port} | {database}.{schema}.{table} |
| Oracle | JDBC Connector | oracle://{host}:{port} | {serviceName}.{schema}.{table} or {sid}.{schema}.{table} |
| Trino | JDBC Connector | trino://{host}:{port} | {catalog}.{schema}.{table} |
| OceanBase | JDBC Connector | oceanbase://{host}:{port} | {database}.{table} |
| DB2 | JDBC Connector | db2://{host}:{port} | {database}.{table} |
| CrateDB | JDBC Connector | cratedb://{host}:{port} | {database}.{table} |

It is a running table. More and more naming info will be added after lineage integration is finished for a specific connector.
HuangZhenQiu marked this conversation as resolved.
Show resolved Hide resolved
Binary file added docs/static/fig/lineage_interfaces.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.