Skip to content

Commit 6006a4f

Browse files
Marius Postaevantahler
Marius Posta
andauthored
bulk-cdk: complete README (#43391)
Co-authored-by: Evan Tahler <[email protected]>
1 parent a16cc58 commit 6006a4f

File tree

1 file changed

+104
-12
lines changed

1 file changed

+104
-12
lines changed

airbyte-cdk/bulk/README.md

+104-12
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,107 @@
11
# Bulk CDK
22

33
The Bulk CDK is the "new java CDK" that's currently incubating.
4-
It's written in Kotlin and consists of a _core_ and a bunch of _toolkits_:
5-
- The _core_ consists of the Micronaut entry point and other objects which are expected in
6-
connectors built using this CDK.
7-
- The _toolkits_ consist of optional modules which contain objects which are common across
8-
multiple (but by no means all) connectors.
9-
10-
While the CDK is incubating, its published version numbers are 0.X where X is monotonically
11-
increasing based on the maximum version value found on the maven repository that the jars are
12-
published to: https://airbyte.mycloudrepo.io/public/repositories/airbyte-public-jars/io/airbyte/bulk-cdk/
13-
14-
Jar publication happens via a github workflow triggered by pushes to the master branch, i.e. after
15-
merging a pull request.
4+
As the name suggests, its purpose is to help develop connectors which extract or load data in bulk.
5+
The Bulk CDK is written in Kotlin and uses the Micronaut framework for dependency injection.
6+
7+
## Structure
8+
9+
The Bulk CDK consists of a _core_ and a bunch of _toolkits_.
10+
11+
### Core
12+
13+
The _core_ consists of the Micronaut entry point and other objects which are expected in
14+
connectors built using this CDK.
15+
16+
The core is broken down into multiple gradle projects; for example the core functionality for
17+
building sources is in `extract`.
18+
19+
Following up on that example, the expectation for a source connector is that it will use all the
20+
interfaces and implementations in `extract` unless it has a very good reason not to.
21+
There is plenty of value in having all source connectors behave predictably.
22+
23+
### Toolkits
24+
25+
The _toolkits_ consist of optional modules which contain objects which are common across
26+
multiple (but by no means all) connectors.
27+
28+
For example, there's an `extract-jdbc` toolkit to help build source connectors which extract data
29+
using the JDBC API.
30+
The expectation for a toolkit is that it provides naive implementations of core interfaces.
31+
These implementations will be thoroughly tested inside the CDK to serve as a baseline of
32+
functionality; however the connector may (and in fact often should!) replace parts of these.
33+
34+
Following up on the example of `extract-jdbc`, a source connector needs to implement SQL query
35+
generation interfaces and, for schema discovery, may prefer to query system tables directly
36+
instead of relying on the generic JDBC metadata methods.
37+
38+
## Dependencies
39+
40+
The Bulk CDK gradle build relies heavily on so-called [BOM dependencies](https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#bill-of-materials-bom-poms).
41+
This pattern is strongly encouraged to keep transitive version conflicts to a minimum. This is beneficial for many reasons, including reproducible builds and a good security posture.
42+
43+
Consider for example the whole Jackson ecosystem.
44+
Using a BOM allows us to add specific Jackson dependencies without having to figure out which
45+
version number to use.
46+
This has some pleasant ripple-effects:
47+
48+
- When the need comes to bump the version, there's only one version number to bump and that's in
49+
the BOM import.
50+
Consequently, the declared version has a much higher chance of being the effective version
51+
picked by gradle during dependency resolution.
52+
53+
- The BOM import is re-exported by the `bulk-cdk-core-base` artifact meaning that the rest of the
54+
CDK as well as connectors don't need to worry about Jackson version numbers either.
55+
56+
It gets better when multiple BOMs are involved.
57+
Consider for example Micronaut and Jackson: Micronaut also depends on Jackson.
58+
This can (and will!) cause dependency version conflicts; these are much easier to resolve by
59+
reconciling just two BOM versions.
60+
61+
While BOMs are undoubtedly useful, let's still try to keep external dependencies to a minimum
62+
outside of tests.
63+
Less dependencies, less problems.
64+
65+
## Developing
66+
67+
Perhaps the most striking difference with the legacy java CDK from a connector DX perspective is
68+
that there are no facilities equivalent to `useLocalCdk = true`.
69+
70+
This is deliberate and the intention here is to force the testing of CDK functionality to remain
71+
in the CDK.
72+
Recall that this is too often not the case in the legacy java CDK because it's simply not possible
73+
to do so there.
74+
75+
The Bulk CDK is different.
76+
Dependency injection makes it possible to mock concrete implementation behavior realistically
77+
enough that Bulk CDK tests have entire fake connectors defined inside of them.
78+
79+
There's no reason now not to first make changes to the CDK and publish those, and only then make
80+
downstream changes to a connector.
81+
82+
If there's truly a need to develop both simultaneously, then the way to go may be to:
83+
1. do experimental development in the connector, keeping the CDK- and the connector-specific code
84+
separate;
85+
2. once the CDK-specific code is reasonably mature, hoist it into the Bulk CDK and test it there;
86+
3. finally, publish those changes and have the connector depend on the latest Bulk CDK version.
87+
88+
## Publishing
89+
90+
While the CDK is incubating, its published version numbers are 0.X where X is the _build number_.
91+
This build number is monotonically increasing and is based on the maximum version value found on
92+
the [maven repository that the jars are published to](https://airbyte.mycloudrepo.io/public/repositories/airbyte-public-jars/io/airbyte/bulk-cdk/).
93+
94+
Artifact publication happens via a [github workflow](../../.github/workflows/publish-bulk-cdk.yml)
95+
which gets triggered by any push to the master branch, i.e. after merging a pull request.
96+
97+
From a contributor's perspective, this means that there's no need to worry about versions or
98+
changelogs.
99+
From a client's perspective, just always use the latest version.
100+
101+
Once the incubation period winds down and the CDK stabilizes, we can start thinking about contracts,
102+
semantic versioning, and so forth; but not until then.
103+
104+
## Licensing
105+
106+
The license for the Bulk CDK is Elastic License 2.0, as specified by the LICENSE file in the root
107+
of this git repository.

0 commit comments

Comments
 (0)