DataSource information enchancement #1095

wajda · 2022-07-18T11:53:48Z

Discussed in #1093

^{Originally posted by vishalag001 July 18, 2022}
Currently, the dataSource collection only contains URI and the name is dependent on the URI(anything after the '/'). However a dataSource should ideally have a tableName, schema and related details.

In Spline, such information is captured on the write operation. Say for a hiveTable write, we have params which contains the tableName, Schema name, etc. For BigQuery, we get datasetName, projectName and tableName.

Is it possible to leverage the operation collection to enhance the dataSource collection ?

Benefits of this approach:

The UI could refer to the schema.tableName rather than the name(which is derived from URI) and make it more meaningful.
It will help to list dataSource URI which fall under same tableName( ,i.e, same table but different partitions)
On UI, the list of different tables can be displayed and from there on one can navigate to the lineageOverview (by the corresponding progress Event). In case of more than 10 partitions, we can use latest partitions to display the lineage

@wajda let me know your thoughts. I am happy to contribute to this.

wajda · 2022-07-18T11:56:13Z

I am happy to contribute to this.

@vishalag001

Let's start with creating a piece of code that for the given data source, finds a better initial name than just a URI suffix. Take a look at the ExecutionProducerRepositoryImpl.scala:56
There you have a parsed execution plan object with all the information including write operation and datasource URI. From that you need to create a set of unique DataSource entities that will be stored into the database at the next step. URI is the ID, so it has to stay the same. Also pay attention on which write operation properties are deemed to be optional and which are required. You cannot expect the execution plan to always come from Spark or a Spline agent, so you can only rely on what's defined in the data model or, at the last resort, check the ExecutionPlan agentInfo and systemInfo properties to apply your logic on execution plans originated in a Spline Spark Agent, and keep the current logic for any other ones.

wajda added this to Spline Jul 18, 2022

wajda moved this to New in Spline Jul 18, 2022

wajda added the feature label Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataSource information enchancement #1095

DataSource information enchancement #1095

wajda commented Jul 18, 2022

wajda commented Jul 18, 2022

DataSource information enchancement #1095

DataSource information enchancement #1095

Comments

wajda commented Jul 18, 2022

Discussed in #1093

wajda commented Jul 18, 2022