🎉 New Destination: Apache Iceberg #18836

Leibnizhu · 2022-11-02T06:20:53Z

What

Add Apache Iceberg Destination.
close #2833. Related to #6745.

How

Add Apache Iceberg Destination, implements by Spark SQL(3.3.0) with iceberg-spark-runtime(1.0.0).
Spark runs on local mode.

🚨 User Impact 🚨

No breaking changes, only new destination.

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
If new credentials are required for use in CI, add them to GSM. Instructions.
/test connector=connectors/<name> command is passing
New Connector version released on Dockerhub by running the /publish command described here
After the connector is published, connector added to connector index as described here
Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

Tests

Unit

see https://gist.github.com/Leibnizhu/c3eaad1b04ad8a8becb77cda28b47313

Integration

execute:

./gradlew :airbyte-integrations:connectors:destination-iceberg:integrationTestJava --tests "*IntegrationTest"

see https://gist.github.com/Leibnizhu/33977e02c191e8d0da68c3a1af2bce57

Acceptance

execute:

# write config in `secrets/config.json`
./gradlew :airbyte-integrations:connectors:destination-iceberg:integrationTestJava --tests "*AcceptanceTest"

see https://gist.github.com/Leibnizhu/c6c76cc9f42856795bb4895ee434849c

CLAassistant · 2022-11-02T06:20:58Z

All committers have signed the CLA.

marcosmarxm · 2022-11-02T19:07:49Z

thanks for the contribution @Leibnizhu someone from the team will review during the week.

marcosmarxm · 2022-11-14T13:49:42Z

@Leibnizhu do you need assistance or review on this work?

Leibnizhu · 2022-11-14T17:05:35Z

@Leibnizhu do you need assistance or review on this work?

yep, pls give some advice

airbyte-integrations/connectors/destination-iceberg/Dockerfile

itaseskii · 2022-11-14T18:14:14Z

...ation-iceberg/src/main/java/io/airbyte/integrations/destination/iceberg/IcebergConsumer.java

+            if (!namespaceSet.contains(namespace)) {
+                namespaceSet.add(namespace);
+                try {
+                    spark.sql("CREATE DATABASE IF NOT EXISTS " + namespace);


When it comes to sql databases most of the time the namespace is used to separate the data in logical schemas. is it really needed to create a new database per namespace?

For iceberg, a table must belong to a database. iceberg's database is for physical isolation, and it's metadata is stored in catalog(e.g. HiveCatalog by hive metastore server).
In this PR, we assume that one namespace is mapping to one iceberg database.

...on-iceberg/src/main/java/io/airbyte/integrations/destination/iceberg/IcebergDestination.java

itaseskii · 2022-11-14T19:35:39Z

...ation-iceberg/src/main/java/io/airbyte/integrations/destination/iceberg/IcebergConsumer.java

+                        writeConfig.isAppendMode() ? "append" : "overwrite",
+                        tempTableName,
+                        finalTableName);
+                    spark.sql("SELECT * FROM %s".formatted(tempTableName))


Can you point me out to where the temp and full tables are created? is this done implicitly?

they are created implicitly by spark sql ( saveAsTable )

Imo temp tables should be created in the startTracked() method which is used for executing all the preparation steps before starting to receive records. It is better to fail fast if you can't create something than to find out midway while replicating data.

itaseskii · 2022-11-17T23:08:25Z

...on-iceberg/src/main/java/io/airbyte/integrations/destination/iceberg/IcebergDestination.java

+    public AirbyteConnectionStatus check(JsonNode config) {
+        try {
+            IcebergCatalogConfig icebergCatalogConfig = icebergCatalogConfigFactory.fromJsonNodeConfig(config);
+            icebergCatalogConfig.check();


Not a big fan of performing business logic in a config class. Imo config classes should only contain methods for initialization of objects. The cleanup logic is also error prone, one example is if you manage to create a table with

Table tempTable = catalog.createTable(tempTableId, schema);

but somehow fail on

btw why do you read records instead of writing?

try (CloseableIterable<Record> records = IcebergGenerics.read(tempTable).build()) { for (Record record : records) { // never reach log.info("Record in temp table: {}", record); } }

your cleanup logic with

boolean dropSuccess = catalog.dropTable(tempTableId);

wont be executed and for that reason such logic should be part of a finally{} block

itaseskii · 2022-11-17T23:38:14Z

/test connector=connectors/destination-iceberg

🕑 connectors/destination-iceberg https://github.com/airbytehq/airbyte/actions/runs/3492896620
❌ connectors/destination-iceberg https://github.com/airbytehq/airbyte/actions/runs/3492896620
🐛 https://gradle.com/s/zhnfxj65phxoi

Build Failed

Test summary info:

Could not find result summary

itaseskii · 2022-11-17T23:57:23Z

Hey @Leibnizhu what type of setup/config do we need to get the tests in IcebergDestinationAcceptanceTest passing? I see that you have different acceptance tests for different configurations so I was wondering what is the point of this test class? :)

Leibnizhu · 2022-11-18T13:01:16Z

thanks for the contribution @Leibnizhu someone from the team will review during the week.

Hey @Leibnizhu what type of setup/config do we need to get the tests in IcebergDestinationAcceptanceTest passing? I see that you have different acceptance tests for different configurations so I was wondering what is the point of this test class? :)

IcebergDestinationAcceptanceTest writes for acceptance tests with external Iceberg catalog, for local development. Examples for config.json are in secrets-examples dir.

IcebergDestinationAcceptanceTest needs an Iceberg Catalog service (by Hive metastore, or Jdbc database, or hdfs), and a storage service (S3 protocol, for example , Amazon S3, or Minio).

I have already write tests with testcontainers in packages: io.airbyte.integrations.destination.iceberg.hadoop, io.airbyte.integrations.destination.iceberg.hive, io.airbyte.integrations.destination.iceberg.jdbc, so, IcebergDestinationAcceptanceTest can be removed.

itaseskii · 2022-11-18T13:46:23Z

thanks for the contribution @Leibnizhu someone from the team will review during the week.

Hey @Leibnizhu what type of setup/config do we need to get the tests in IcebergDestinationAcceptanceTest passing? I see that you have different acceptance tests for different configurations so I was wondering what is the point of this test class? :)

IcebergDestinationAcceptanceTest writes for acceptance tests with external Iceberg catalog, for local development. Examples for config.json are in secrets-examples dir.

IcebergDestinationAcceptanceTest needs an Iceberg Catalog service (by Hive metastore, or Jdbc database, or hdfs), and a storage service (S3 protocol, for example , Amazon S3, or Minio).

I have already write tests with testcontainers in packages: io.airbyte.integrations.destination.iceberg.hadoop, io.airbyte.integrations.destination.iceberg.hive, io.airbyte.integrations.destination.iceberg.jdbc, so, IcebergDestinationAcceptanceTest can be removed.

@Leibnizhu okay then please remove it so we can have a passing test run in the CI

marcosmarxm · 2022-11-18T16:35:06Z

/test connector=connectors/destination-iceberg

🕑 connectors/destination-iceberg https://github.com/airbytehq/airbyte/actions/runs/3498603268
✅ connectors/destination-iceberg https://github.com/airbytehq/airbyte/actions/runs/3498603268
No Python unittests run

Build Passed

Test summary info:

All Passed

itaseskii · 2022-11-18T20:35:59Z

...rg/src/main/java/io/airbyte/integrations/destination/iceberg/config/format/FormatConfig.java

+@Data
+public class FormatConfig {
+
+    public static final int DEFAULT_FLUSH_BATCH_SIZE = 10000;


Buffering records based on number is a bit risky since 10_000 records depending on size can quickly exhaust your memory. It is better if actual storage size is taken into consideration as well.

.../src/main/java/io/airbyte/integrations/destination/iceberg/config/format/DataFileFormat.java

...ation-iceberg/src/main/java/io/airbyte/integrations/destination/iceberg/IcebergConsumer.java

itaseskii · 2022-11-18T20:54:17Z

@marcosmarxm Overall this connector seems solid with some improvements points from my side. I think that we can proceed with merging.

marcosmarxm

Thanks @Leibnizhu

…package 3. add unit tests

Leibnizhu · 2022-11-19T06:19:26Z

@marcosmarxm Overall this connector seems solid with some improvements points from my side. I think that we can proceed with merging.

thanks for reviewing. I will keep on improving these points at next PR.

1. create temp table in startTracked() 2. extract iceberg spark sql operations to IcebergOperations, from IcebergConsumer and IcebergCatalogConfig 3. in check() method, create temp table, then write something and read it , and drop table finally.

wahbiharibi · 2023-02-06T16:51:13Z

Trying to test the connector but getting the following error. Any idea?

2023-02-06 16:36:22 ERROR i.a.w.i.DefaultAirbyteStreamFactory(internalLog):163 - Exception attempting to access the Iceberg catalog:
airbyte-worker | Stack Trace: org.apache.iceberg.exceptions.ValidationException: Invalid S3 URI, cannot determine scheme: file:/user/hive/warehouse/temp_1675701381294/metadata/00000-2b6c90f7-2314-4a38-bf3b-bc33ffbdb64b.metadata.json
airbyte-worker | at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:49)
airbyte-worker | at org.apache.iceberg.aws.s3.S3URI.(S3URI.java:72)
airbyte-worker | at org.apache.iceberg.aws.s3.S3OutputFile.fromLocation(S3OutputFile.java:40)
airbyte-worker | at org.apache.iceberg.aws.s3.S3FileIO.newOutputFile(S3FileIO.java:132)
airbyte-worker | at org.apache.iceberg.BaseMetastoreTableOperations.writeNewMetadata(BaseMetastoreTableOperations.java:157)
airbyte-worker | at org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:234)
airbyte-worker | at org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:133)
airbyte-worker | at org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.create(BaseMetastoreCatalog.java:174)
airbyte-worker | at org.apache.iceberg.catalog.Catalog.createTable(Catalog.java:75)
airbyte-worker | at org.apache.iceberg.catalog.Catalog.createTable(Catalog.java:118)
airbyte-worker | at io.airbyte.integrations.destination.iceberg.config.catalog.IcebergCatalogConfig.check(IcebergCatalogConfig.java:47)
airbyte-worker | at io.airbyte.integrations.destination.iceberg.IcebergDestination.check(IcebergDestination.java:49)
airbyte-worker | at io.airbyte.integrations.base.IntegrationRunner.runInternal(IntegrationRunner.java:125)
airbyte-worker | at io.airbyte.integrations.base.IntegrationRunner.run(IntegrationRunner.java:100)
airbyte-worker | at io.airbyte.integrations.destination.iceberg.IcebergDestination.main(IcebergDestination.java:42)

Leibnizhu · 2023-02-06T23:00:06Z

Trying to test the connector but getting the following error. Any idea?

@wahbiharibi is your hive default database created as an Iceberg database?

normally, hive default database is created by hive, and it's location is local or hdfs; so default database cannot use as an Iceberg database with S3 storage.

you could try another nonexistent database name, and Iceberg connector will create it with s3 location for you.

For example, default database and an Iceberg database shows in hive metastore server's DBS table:

wahbiharibi · 2023-02-07T10:37:22Z

Trying to test the connector but getting the following error. Any idea?

@wahbiharibi is your hive default database created as an Iceberg database?

normally, hive default database is created by hive, and it's location is local or hdfs; so default database cannot use as an Iceberg database with S3 storage.

you could try another nonexistent database name, and Iceberg connector will create it with s3 location for you.

For example, default database and an Iceberg database shows in hive metastore server's DBS table:

Thanks @Leibnizhu. It worked :)

natalyjazzviolin · 2023-03-10T14:13:30Z

Currently blocked by https://github.com/airbytehq/airbyte-internal-issues/issues/2695 , waiting on sandbox account.

OmarSultan85 · 2023-05-23T23:12:38Z

I am not sure if this is the right place to post, but I was wondering if there was a way to change the catalog name in the sprak config file. I noticed that it is using the name iceberg based on the Constants value of CATALOG_NAME. Is there anyway to make this configurable or add it to the spec? I am referring ehre to the Jdbc Catalog

github-actions bot added area/connectors Connector related issues area/documentation Improvements or additions to documentation labels Nov 2, 2022

octavia-squidington-iv added community connectors/destination/iceberg labels Nov 2, 2022

Leibnizhu changed the title ~~Add Apache Iceberg Destination~~ 🎉 New Destination: Apache Iceberg Nov 2, 2022

Leibnizhu marked this pull request as ready for review November 2, 2022 06:55

octavia-squidington-iv added the bounty label Nov 2, 2022

Leibnizhu force-pushed the feat/iceberg branch from 8c4a341 to 658b609 Compare November 2, 2022 07:11

Leibnizhu marked this pull request as draft November 2, 2022 08:29

Leibnizhu marked this pull request as ready for review November 2, 2022 09:35

marcosmarxm added the hacktober label Nov 2, 2022

marcosmarxm requested a review from itaseskii November 2, 2022 22:17

marcosmarxm assigned itaseskii Nov 2, 2022

sajarin added the bounty-XL Maintainer program: claimable extra large bounty PR label Nov 7, 2022

itaseskii reviewed Nov 14, 2022

View reviewed changes

itaseskii reviewed Nov 17, 2022

View reviewed changes

itaseskii reviewed Nov 18, 2022

View reviewed changes

.../src/main/java/io/airbyte/integrations/destination/iceberg/config/format/DataFileFormat.java Outdated Show resolved Hide resolved

itaseskii reviewed Nov 18, 2022

View reviewed changes

...ation-iceberg/src/main/java/io/airbyte/integrations/destination/iceberg/IcebergConsumer.java Outdated Show resolved Hide resolved

marcosmarxm approved these changes Nov 18, 2022

View reviewed changes

Leibnizhu and others added 20 commits November 18, 2022 19:24

wip: developing Iceberg(s3 & hive catalog) Destination 2

2576799

refactor: config

70e5aa5

feat: add hadoop and jdbc catalog implements

976fdda

docs: add docs and config examples

fa2f84d

style

b946fb1

feat: S3Config

7f3256b

fix: acceptance test, and unit test

3fbc7b8

chore: remove sensitive logs

eee7194

docs: builds.md

7b34ec3

refactor: 1.add flush batch size and auto compact configs 2.refactor …

6c2618d

…package 3. add unit tests

test: add integration test

9c92e75

test: Add HadoopCatalog integration tests

1965e26

docs: add bootstrap.md

35afab0

test: Add HiveCatalog integration tests

a26afbd

perf: purge drop temp Iceberg table

d77d3f9

chore: delete unnecessary log

d8de9ac

remove iceberg accpt test file

6ab6bfc

run format

fa17207

readd iceberg

8020724

regenrate spec

dde1a8a

marcosmarxm force-pushed the feat/iceberg branch from 88f3125 to dde1a8a Compare November 18, 2022 22:27

marcosmarxm merged commit 456c920 into airbytehq:master Nov 18, 2022

octavia-squidington-iii mentioned this pull request Nov 19, 2022

Bump Airbyte version from 0.40.20 to 0.40.21 #19634

Merged

🎉 New Destination: Apache Iceberg #18836

🎉 New Destination: Apache Iceberg #18836

Uh oh!

Conversation

Leibnizhu commented Nov 2, 2022 • edited by jrolom Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Recommended reading order

🚨 User Impact 🚨

Pre-merge Checklist

Community member or Airbyter

Airbyter

Tests

Uh oh!

CLAassistant commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcosmarxm commented Nov 2, 2022

Uh oh!

marcosmarxm commented Nov 14, 2022

Uh oh!

Leibnizhu commented Nov 14, 2022

Uh oh!

Uh oh!

itaseskii Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

Leibnizhu Nov 15, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

itaseskii Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

Leibnizhu Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itaseskii Nov 17, 2022

Choose a reason for hiding this comment

Uh oh!

itaseskii Nov 17, 2022

Choose a reason for hiding this comment

Uh oh!

itaseskii commented Nov 17, 2022 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Build Failed

Uh oh!

itaseskii commented Nov 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Leibnizhu commented Nov 18, 2022

Uh oh!

itaseskii commented Nov 18, 2022

Uh oh!

marcosmarxm commented Nov 18, 2022 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Build Passed

Uh oh!

itaseskii Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

itaseskii commented Nov 18, 2022

Uh oh!

marcosmarxm left a comment

Choose a reason for hiding this comment

Uh oh!

Leibnizhu commented Nov 19, 2022

Uh oh!

wahbiharibi commented Feb 6, 2023

Uh oh!

Leibnizhu commented Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wahbiharibi commented Feb 7, 2023

Uh oh!

natalyjazzviolin commented Mar 10, 2023

Leibnizhu commented Nov 2, 2022 •

edited by jrolom

Loading

CLAassistant commented Nov 2, 2022 •

edited

Loading

Leibnizhu Nov 15, 2022 •

edited

Loading

itaseskii commented Nov 17, 2022 •

edited by github-actions bot

Loading

itaseskii commented Nov 17, 2022 •

edited

Loading

marcosmarxm commented Nov 18, 2022 •

edited by github-actions bot

Loading

Leibnizhu commented Feb 6, 2023 •

edited

Loading

OmarSultan85 commented May 23, 2023 •

edited

Loading