From 6006a4f234498dab7d49d2913d057c427db9a2a4 Mon Sep 17 00:00:00 2001 From: Marius Posta Date: Thu, 8 Aug 2024 13:54:25 -0700 Subject: [PATCH] bulk-cdk: complete README (#43391) Co-authored-by: Evan Tahler --- airbyte-cdk/bulk/README.md | 116 +++++++++++++++++++++++++++++++++---- 1 file changed, 104 insertions(+), 12 deletions(-) diff --git a/airbyte-cdk/bulk/README.md b/airbyte-cdk/bulk/README.md index 88f842d8887e..a267bfd11434 100644 --- a/airbyte-cdk/bulk/README.md +++ b/airbyte-cdk/bulk/README.md @@ -1,15 +1,107 @@ # Bulk CDK The Bulk CDK is the "new java CDK" that's currently incubating. -It's written in Kotlin and consists of a _core_ and a bunch of _toolkits_: -- The _core_ consists of the Micronaut entry point and other objects which are expected in - connectors built using this CDK. -- The _toolkits_ consist of optional modules which contain objects which are common across - multiple (but by no means all) connectors. - -While the CDK is incubating, its published version numbers are 0.X where X is monotonically -increasing based on the maximum version value found on the maven repository that the jars are -published to: https://airbyte.mycloudrepo.io/public/repositories/airbyte-public-jars/io/airbyte/bulk-cdk/ - -Jar publication happens via a github workflow triggered by pushes to the master branch, i.e. after -merging a pull request. +As the name suggests, its purpose is to help develop connectors which extract or load data in bulk. +The Bulk CDK is written in Kotlin and uses the Micronaut framework for dependency injection. + +## Structure + +The Bulk CDK consists of a _core_ and a bunch of _toolkits_. + +### Core + +The _core_ consists of the Micronaut entry point and other objects which are expected in +connectors built using this CDK. + +The core is broken down into multiple gradle projects; for example the core functionality for +building sources is in `extract`. + +Following up on that example, the expectation for a source connector is that it will use all the +interfaces and implementations in `extract` unless it has a very good reason not to. +There is plenty of value in having all source connectors behave predictably. + +### Toolkits + +The _toolkits_ consist of optional modules which contain objects which are common across +multiple (but by no means all) connectors. + +For example, there's an `extract-jdbc` toolkit to help build source connectors which extract data +using the JDBC API. +The expectation for a toolkit is that it provides naive implementations of core interfaces. +These implementations will be thoroughly tested inside the CDK to serve as a baseline of +functionality; however the connector may (and in fact often should!) replace parts of these. + +Following up on the example of `extract-jdbc`, a source connector needs to implement SQL query +generation interfaces and, for schema discovery, may prefer to query system tables directly +instead of relying on the generic JDBC metadata methods. + +## Dependencies + +The Bulk CDK gradle build relies heavily on so-called [BOM dependencies](https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#bill-of-materials-bom-poms). +This pattern is strongly encouraged to keep transitive version conflicts to a minimum. This is beneficial for many reasons, including reproducible builds and a good security posture. + +Consider for example the whole Jackson ecosystem. +Using a BOM allows us to add specific Jackson dependencies without having to figure out which +version number to use. +This has some pleasant ripple-effects: + +- When the need comes to bump the version, there's only one version number to bump and that's in + the BOM import. + Consequently, the declared version has a much higher chance of being the effective version + picked by gradle during dependency resolution. + +- The BOM import is re-exported by the `bulk-cdk-core-base` artifact meaning that the rest of the + CDK as well as connectors don't need to worry about Jackson version numbers either. + +It gets better when multiple BOMs are involved. +Consider for example Micronaut and Jackson: Micronaut also depends on Jackson. +This can (and will!) cause dependency version conflicts; these are much easier to resolve by +reconciling just two BOM versions. + +While BOMs are undoubtedly useful, let's still try to keep external dependencies to a minimum +outside of tests. +Less dependencies, less problems. + +## Developing + +Perhaps the most striking difference with the legacy java CDK from a connector DX perspective is +that there are no facilities equivalent to `useLocalCdk = true`. + +This is deliberate and the intention here is to force the testing of CDK functionality to remain +in the CDK. +Recall that this is too often not the case in the legacy java CDK because it's simply not possible +to do so there. + +The Bulk CDK is different. +Dependency injection makes it possible to mock concrete implementation behavior realistically +enough that Bulk CDK tests have entire fake connectors defined inside of them. + +There's no reason now not to first make changes to the CDK and publish those, and only then make +downstream changes to a connector. + +If there's truly a need to develop both simultaneously, then the way to go may be to: +1. do experimental development in the connector, keeping the CDK- and the connector-specific code + separate; +2. once the CDK-specific code is reasonably mature, hoist it into the Bulk CDK and test it there; +3. finally, publish those changes and have the connector depend on the latest Bulk CDK version. + +## Publishing + +While the CDK is incubating, its published version numbers are 0.X where X is the _build number_. +This build number is monotonically increasing and is based on the maximum version value found on +the [maven repository that the jars are published to](https://airbyte.mycloudrepo.io/public/repositories/airbyte-public-jars/io/airbyte/bulk-cdk/). + +Artifact publication happens via a [github workflow](../../.github/workflows/publish-bulk-cdk.yml) +which gets triggered by any push to the master branch, i.e. after merging a pull request. + +From a contributor's perspective, this means that there's no need to worry about versions or +changelogs. +From a client's perspective, just always use the latest version. + +Once the incubation period winds down and the CDK stabilizes, we can start thinking about contracts, +semantic versioning, and so forth; but not until then. + +## Licensing + +The license for the Bulk CDK is Elastic License 2.0, as specified by the LICENSE file in the root +of this git repository.