You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally posted by vishalag001 July 18, 2022
Currently, the dataSource collection only contains URI and the name is dependent on the URI(anything after the '/'). However a dataSource should ideally have a tableName, schema and related details.
In Spline, such information is captured on the write operation. Say for a hiveTable write, we have params which contains the tableName, Schema name, etc. For BigQuery, we get datasetName, projectName and tableName.
Is it possible to leverage the operation collection to enhance the dataSource collection ?
Benefits of this approach:
The UI could refer to the schema.tableName rather than the name(which is derived from URI) and make it more meaningful.
It will help to list dataSource URI which fall under same tableName( ,i.e, same table but different partitions)
On UI, the list of different tables can be displayed and from there on one can navigate to the lineageOverview (by the corresponding progress Event). In case of more than 10 partitions, we can use latest partitions to display the lineage
@wajda let me know your thoughts. I am happy to contribute to this.
The text was updated successfully, but these errors were encountered:
Let's start with creating a piece of code that for the given data source, finds a better initial name than just a URI suffix. Take a look at the ExecutionProducerRepositoryImpl.scala:56
There you have a parsed execution plan object with all the information including write operation and datasource URI. From that you need to create a set of unique DataSource entities that will be stored into the database at the next step. URI is the ID, so it has to stay the same. Also pay attention on which write operation properties are deemed to be optional and which are required. You cannot expect the execution plan to always come from Spark or a Spline agent, so you can only rely on what's defined in the data model or, at the last resort, check the ExecutionPlan agentInfo and systemInfo properties to apply your logic on execution plans originated in a Spline Spark Agent, and keep the current logic for any other ones.
Discussed in #1093
Originally posted by vishalag001 July 18, 2022
Currently, the dataSource collection only contains URI and the name is dependent on the URI(anything after the '/'). However a dataSource should ideally have a tableName, schema and related details.
In Spline, such information is captured on the write operation. Say for a hiveTable write, we have params which contains the tableName, Schema name, etc. For BigQuery, we get datasetName, projectName and tableName.
Is it possible to leverage the operation collection to enhance the dataSource collection ?
Benefits of this approach:
@wajda let me know your thoughts. I am happy to contribute to this.
The text was updated successfully, but these errors were encountered: