-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[590] Add RunCatalogSync utility for synchronizing tables across catalogs #591
base: main
Are you sure you want to change the base?
Conversation
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSyncClient.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/model/exception/CatalogRefreshException.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/model/sync/SyncResult.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogUtils.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogUtils.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/model/catalog/CatalogType.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/ExternalCatalog.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/ExternalCatalog.java
Outdated
Show resolved
Hide resolved
7f307d4
to
06501c1
Compare
I'm pushing another PR for the client side changes for CatalogSync. |
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/ExternalCatalog.java
Outdated
Show resolved
Hide resolved
After putting more thought into it, I think we can keep the cross catalog sync as a separate function and not integrate it with TableFormatSync.
This will be separate utility option in RunSync to configure a catalogConfig.yaml file. |
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/ExternalTable.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/ExternalTable.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/TargetCatalog.java
Outdated
Show resolved
Hide resolved
/** This class represents the unique identifier for a table in a catalog. */ | ||
@Value | ||
@Builder | ||
public class CatalogTableIdentifier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What do you think of calling this class
TableIdentifier
. - Also, we should consider making the naming more generic, since it should represent all types of table identifiers. For instance, while
databaseName
is a popular namespace, there's alsoschema
. In some scenarios,databaseName
is synonymous withcatalogName
. The current two-part naming based ontable
anddatabaseName
seems a bit restrictive to me.
Would you mind sharing the use cases this naming caters to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 is okay, included the prefix Catalog because TableIdentifier
is overloaded in the dependent projects (iceberg, hudi, delta etc.) and didn't want to add another identifier with the same name.
For 2, agreed that each catalog or system has a different term for what's called a "logical grouping of tables".
I have looked up this name in different systems and have found databaseName, namespace, schema are the popular ones, a more generic name that comes to my mind is tableCollection
or tableGroup
? If we can't find a generic name, choosing databaseName maybe okay ? Regardless of the name we choose the conversion interfaces for each catalog are responsible for translating the catalog table definition to this representation.
xtable-api/src/main/java/org/apache/xtable/model/catalog/CatalogTableIdentifier.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/conversion/ConversionController.java
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/extractor/ConversionSource.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSyncClient.java
Outdated
Show resolved
Hide resolved
xtable-utilities/src/test/java/org/apache/xtable/utilities/TestRunCatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/spi/sync/CatalogSyncClient.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/conversion/ConversionController.java
Outdated
Show resolved
Hide resolved
xtable-api/src/test/java/org/apache/xtable/spi/sync/TestCatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/conversion/ConversionController.java
Show resolved
Hide resolved
Hi @vinishjail97 , this PR seems to cover two distinct features: refactoring to add |
@@ -34,14 +35,21 @@ public class ConversionConfig { | |||
@NonNull SourceTable sourceTable; | |||
// One or more targets to sync the table metadata to | |||
List<TargetTable> targetTables; | |||
// Each target table can be synced to multiple target catalogs, this is map from | |||
// targetTableIdentifier to target catalogs. | |||
Map<String, List<TargetCatalogConfig>> targetCatalogs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about the expected behavior if XTable
fails to update a subset of the TargetCatalogs
and if that impacts the way incremental sync works?
This may have been covered elsewhere in the PR that I might have missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't change the behavior of existing incremental sync, the failure is returned back as part of this, btw existing XTable users can sync use incremental sync without configuring source/target catalogs, the existing RunSync class in utilities or sync function in ConversionController
is not changing.
// The sync status for each catalog.
List<CatalogSyncStatus> catalogSyncStatusList;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xtable-api/src/main/java/org/apache/xtable/conversion/TargetTable.java
Outdated
Show resolved
Hide resolved
public class CatalogTableIdentifier { | ||
/** | ||
* Catalogs have the ability to group tables logically, databaseName is the identifier for such | ||
* logical classification. The alternate names for this field include namespace, schemaName etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the nesting vary depending on the database. However, the usage of a 3-part naming is quite common. The convention includes name of database, schema, and table or view. I am wondering if it would be feasible to keep the table Id as a string instead of an object whose structure can vary a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the three level naming convention, kept catalogName as optional.
c227b76
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an internal representation for the table identifier in a catalog, for catalogs following a different convention we will interoperate using this model.
Hi @vinishjail97, This change significantly impacts XTable usage. The current description doesn't provide enough context for me. Could we set up a call or create a document to discuss this change in detail? |
xtable-api/src/main/java/org/apache/xtable/conversion/TargetCatalogConfig.java
Show resolved
Hide resolved
@ashvina I will create two separate PR's to avoid the confusion.
We are not changing the way incremental sync works for table formats, the sync method in |
7ab5659
to
1c07c07
Compare
@ashvina I have split this change two PR's for clarity.
I will push the PR for an RFC doc for this tomorrow but have replied to most of your comments. Regarding the class |
xtable-core/src/test/java/org/apache/xtable/delta/ITDeltaDeleteVectorConvert.java
Outdated
Show resolved
Hide resolved
1c07c07
to
a723014
Compare
115abc5
to
15a54e6
Compare
a723014
to
ced6288
Compare
6e09f5d
to
893b1a6
Compare
893b1a6
to
eefbdd2
Compare
xtable-api/src/main/java/org/apache/xtable/conversion/ExternalCatalogConfig.java
Outdated
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/ConversionConfig.java
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/ExternalCatalogConfigFactory.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/conversion/ConversionController.java
Show resolved
Hide resolved
xtable-core/src/test/java/org/apache/xtable/TestSparkDeltaTable.java
Outdated
Show resolved
Hide resolved
import org.apache.xtable.utilities.RunCatalogSync.DatasetConfig.TargetTableIdentifier; | ||
|
||
/** | ||
* Provides standalone process for reading tables from a source catalog and synchronizing their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this will also do table format syncing, do we want to just combine this functionality with the existing tool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have kept RunCatalogSync
because there will be future functionality of synchronizing permissions as well and RunSync
will get overloaded with options.
xtable-utilities/src/test/java/org/apache/xtable/utilities/TestRunCatalogSync.java
Show resolved
Hide resolved
xtable-api/src/main/java/org/apache/xtable/conversion/TargetCatalogConfig.java
Outdated
Show resolved
Hide resolved
xtable-core/src/test/java/org/apache/xtable/conversion/TestConversionController.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/ExternalCatalogConfigFactory.java
Outdated
Show resolved
Hide resolved
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunCatalogSync.java
Outdated
Show resolved
Hide resolved
xtable-utilities/src/main/java/org/apache/xtable/utilities/RunCatalogSync.java
Outdated
Show resolved
Hide resolved
try { | ||
conversionController.syncTableAcrossCatalogs( | ||
conversionConfig, | ||
getConversionSourceProviders(tableFormats, tableFormatConverters, hadoopConf)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this a bit confusing when reading through the code since it is using targets to create sources at first glance. Would the SourceProvider
be more clear if it was a TableStateProvider
or something that does not include source/target language we use elsewhere for the table format sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will think of a better name, for now keeping this refactoring in a separate PR.
#613
xtable-core/src/main/java/org/apache/xtable/catalog/ExternalCatalogConfigFactory.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/ExternalCatalogConfigFactory.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/CatalogConversionFactory.java
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/catalog/CatalogConversionFactory.java
Outdated
Show resolved
Hide resolved
Looks good to me overall but let's wait to merge until the RFC is finalized. |
Important Read
What is the purpose of the pull request
Introduced
RunCatalogSync
utility which does the following on a high level. This unblocks the ability to sync tables from a source catalog to multiple target catalogs, you can look at the sample configuration herextable-utilities/src/test/resources/catalogConfig.yaml
See RFC for more details on the deign and flow chart -> #605
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This change added tests and can be verified as follows:
(example:)