-
Notifications
You must be signed in to change notification settings - Fork 4.5k
bulk-cdk: complete README #43391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bulk-cdk: complete README #43391
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,107 @@ | ||
# Bulk CDK | ||
|
||
The Bulk CDK is the "new java CDK" that's currently incubating. | ||
It's written in Kotlin and consists of a _core_ and a bunch of _toolkits_: | ||
- The _core_ consists of the Micronaut entry point and other objects which are expected in | ||
connectors built using this CDK. | ||
- The _toolkits_ consist of optional modules which contain objects which are common across | ||
multiple (but by no means all) connectors. | ||
|
||
While the CDK is incubating, its published version numbers are 0.X where X is monotonically | ||
increasing based on the maximum version value found on the maven repository that the jars are | ||
published to: https://airbyte.mycloudrepo.io/public/repositories/airbyte-public-jars/io/airbyte/bulk-cdk/ | ||
|
||
Jar publication happens via a github workflow triggered by pushes to the master branch, i.e. after | ||
merging a pull request. | ||
As the name suggests, its purpose is to help develop connectors which extract or load data in bulk. | ||
The Bulk CDK is written in Kotlin and uses the Micronaut framework for dependency injection. | ||
|
||
## Structure | ||
|
||
The Bulk CDK consists of a _core_ and a bunch of _toolkits_. | ||
|
||
### Core | ||
|
||
The _core_ consists of the Micronaut entry point and other objects which are expected in | ||
connectors built using this CDK. | ||
|
||
The core is broken down into multiple gradle projects; for example the core functionality for | ||
building sources is in `extract`. | ||
|
||
Following up on that example, the expectation for a source connector is that it will use all the | ||
interfaces and implementations in `extract` unless it has a very good reason not to. | ||
There is plenty of value in having all source connectors behave predictably. | ||
|
||
### Toolkits | ||
|
||
The _toolkits_ consist of optional modules which contain objects which are common across | ||
multiple (but by no means all) connectors. | ||
|
||
For example, there's an `extract-jdbc` toolkit to help build source connectors which extract data | ||
using the JDBC API. | ||
The expectation for a toolkit is that it provides naive implementations of core interfaces. | ||
These implementations will be thoroughly tested inside the CDK to serve as a baseline of | ||
functionality; however the connector may (and in fact often should!) replace parts of these. | ||
|
||
Following up on the example of `extract-jdbc`, a source connector needs to implement SQL query | ||
generation interfaces and, for schema discovery, may prefer to query system tables directly | ||
instead of relying on the generic JDBC metadata methods. | ||
|
||
## Dependencies | ||
|
||
The Bulk CDK gradle build relies heavily on so-called [BOM dependencies](https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#bill-of-materials-bom-poms). | ||
This pattern is strongly encouraged to keep transitive version conflicts to a minimum. | ||
postamar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Consider for example the whole Jackson ecosystem. | ||
Using a BOM allows us to add specific Jackson dependencies without having to figure out which | ||
version number to use. | ||
This has some pleasant ripple-effects: | ||
|
||
- When the need comes to bump the version, there's only one version number to bump and that's in | ||
the BOM import. | ||
Consequently, the declared version has a much higher chance of being the effective version | ||
picked by gradle during dependency resolution. | ||
|
||
- The BOM import is re-exported by the `bulk-cdk-core-base` artifact meaning that the rest of the | ||
CDK as well as connectors don't need to worry about Jackson version numbers either. | ||
|
||
It gets better when multiple BOMs are involve. | ||
postamar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Consider for example Micronaut and Jackson: Micronaut also depends on Jackson. | ||
This can (and will!) cause dependency version conflicts; these are much easier to resolve by | ||
reconciling just two BOM versions. | ||
|
||
While BOMs are undoubtedly useful, let's still try to keep external dependencies to a minimum | ||
outside of tests. | ||
Less dependencies, less problems. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there some guidance that we want the lower parts of the CDK to have as few dependencies as possible, but the toolkits can gain deps? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know about that. Toolkits should also stay slim, in my opinion. Where it matters a lot less is tests, but even so, with gradle when you have a lot of dependencies (and therefore, conflicts) it's super easy to declare depending on version X but really because of a transitive dependency you're using version Y instead, without knowing. Typically what then happens is when you bump the declared dep to a later version Z a whole bunch of unrelated stuff breaks. |
||
|
||
## Developing | ||
|
||
Perhaps the most striking difference with the legacy java CDK from a connector DX perspective is | ||
that there are no facilities equivalent to `useLocalCdk = true`. | ||
|
||
This is deliberate and the intention here is to force the testing of CDK functionality to remain | ||
in the CDK. | ||
Recall that this is too often not the case in the legacy java CDK because it's simply not possible | ||
to do so there. | ||
|
||
The Bulk CDK is different. | ||
Dependency injection makes it possible to mock concrete implementation behavior realistically | ||
enough that Bulk CDK tests have entire fake connectors defined inside of them. | ||
postamar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
There's no reason now not to first make changes to the CDK and publish those, and only then make | ||
downstream changes to a connector. | ||
|
||
If there's truly a need to develop both simultaneously, then the way to go may be to: | ||
1. do experimental development in the connector, keeping the CDK- and the connector-specific code | ||
separate; | ||
2. once the CDK-specific code is reasonably mature, hoist it into the Bulk CDK and test it there; | ||
3. finally, publish those changes and have the connector depend on the latest Bulk CDK version. | ||
|
||
## Publishing | ||
|
||
While the CDK is incubating, its published version numbers are 0.X where X is the _build number_. | ||
This build number is monotonically increasing and is based on the maximum version value found on | ||
the [maven repository that the jars are published to](https://airbyte.mycloudrepo.io/public/repositories/airbyte-public-jars/io/airbyte/bulk-cdk/). | ||
|
||
Artifact publication happens via a [github workflow](../../.github/workflows/publish-bulk-cdk.yml) | ||
which gets triggered by any push to the master branch, i.e. after merging a pull request. | ||
|
||
From a contributor's perspective, this means that there's no need to worry about versions or | ||
changelogs. | ||
From a client's perspective, just always use the latest version. | ||
|
||
Once the incubation period winds down and the CDK stabilizes, we can start thinking about contracts, | ||
semantic versioning, and so forth; but not until then. | ||
|
||
## Licensing | ||
|
||
The license for the Bulk CDK is Elastic License 2.0, as specified by the LICENSE file in the root | ||
of this git repository. | ||
Comment on lines
+104
to
+107
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @Hesperide:
This is a departure from our current CDK license, which is MIT There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π These terms have really grown on me!