Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataSource information enchancement #1095

Open
wajda opened this issue Jul 18, 2022 Discussed in #1093 · 1 comment
Open

DataSource information enchancement #1095

wajda opened this issue Jul 18, 2022 Discussed in #1093 · 1 comment
Labels

Comments

@wajda
Copy link
Contributor

wajda commented Jul 18, 2022

Discussed in #1093

Originally posted by vishalag001 July 18, 2022
Currently, the dataSource collection only contains URI and the name is dependent on the URI(anything after the '/'). However a dataSource should ideally have a tableName, schema and related details.

In Spline, such information is captured on the write operation. Say for a hiveTable write, we have params which contains the tableName, Schema name, etc. For BigQuery, we get datasetName, projectName and tableName.

Is it possible to leverage the operation collection to enhance the dataSource collection ?

Benefits of this approach:

  • The UI could refer to the schema.tableName rather than the name(which is derived from URI) and make it more meaningful.
  • It will help to list dataSource URI which fall under same tableName( ,i.e, same table but different partitions)
  • On UI, the list of different tables can be displayed and from there on one can navigate to the lineageOverview (by the corresponding progress Event). In case of more than 10 partitions, we can use latest partitions to display the lineage

@wajda let me know your thoughts. I am happy to contribute to this.

@wajda wajda added this to Spline Jul 18, 2022
@wajda wajda moved this to New in Spline Jul 18, 2022
@wajda wajda added the feature label Jul 18, 2022
@wajda
Copy link
Contributor Author

wajda commented Jul 18, 2022

I am happy to contribute to this.

@vishalag001

Let's start with creating a piece of code that for the given data source, finds a better initial name than just a URI suffix. Take a look at the ExecutionProducerRepositoryImpl.scala:56
There you have a parsed execution plan object with all the information including write operation and datasource URI. From that you need to create a set of unique DataSource entities that will be stored into the database at the next step. URI is the ID, so it has to stay the same. Also pay attention on which write operation properties are deemed to be optional and which are required. You cannot expect the execution plan to always come from Spark or a Spline agent, so you can only rely on what's defined in the data model or, at the last resort, check the ExecutionPlan agentInfo and systemInfo properties to apply your logic on execution plans originated in a Spline Spark Agent, and keep the current logic for any other ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New
Development

No branches or pull requests

1 participant