Skip to content

Destination Azure Blob Storage: connection time out due to looping over files in container. #16016

Open
@marcuslind90

Description

@marcuslind90

Environment

  • Airbyte version: 0.40.0-alpha
  • OS Version / Instance: Azure Kubernetes Service
  • Deployment: Kubernetes Helm Chart
  • Source Connector and version: PostgreSQL
  • Destination Connector and version: Azure Blob Storage Alpha
  • Step where error happened: Check Connection to Destination Azure Blob Storage

Current Behavior

Connection check time out because its looping through every single blob within the container.

https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-azure-blob-storage/src/main/java/io/airbyte/integrations/destination/azure_blob_storage/AzureBlobStorageConnectionChecker.java#L93

  public void attemptWriteAndDelete() {
    initTestContainerAndBlob();
    writeUsingAppendBlock("Some test data");
    listBlobsInContainer()
        .forEach(
            blobItem -> LOGGER.info(
                "Blob name: " + blobItem.getName() + "Snapshot: " + blobItem.getSnapshot()));

    deleteBlob();
  }

The listBlobsInContainer() call will attempt to list all blobs with prefix / which means all blobs in the whole container. If you have a large data lake that means millions of records that are looped through and logged out.

Expected Behavior

Limit number of files that the connection tries attempts to list. You can pass in options into the Azure SDK listBlobs() call to limit things:

https://docs.microsoft.com/en-us/java/api/com.azure.storage.blob.models.listblobsoptions?view=azure-java-stable

Logs

Log4j2Appender says: 2022-08-26 16:10:05 INFO i.a.i.d.a.AzureBlobStorageConnectionChecker(lambda$attemptWriteAndDelete$0):54 - Blob name: brad-test/parts/daily_store_inventory_part_pln_dated.parquet/part-14048-6ac91a9f-5924-4385-971e-43348051854c-c000.snappy.parquetSnapshot: null
2022-08-26 16:10:05 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-08-26 16:10:05 INFO i.a.i.d.a.AzureBlobStorageConnectionChecker(lambda$attemptWriteAndDelete$0):54 - Blob name: brad-test/parts/daily_store_inventory_part_pln_dated.parquet/part-14049-6ac91a9f-5924-4385-971e-43348051854c-c000.snappy.parquetSnapshot: null
Log4j2Appender says: 2022-08-26 16:10:05 INFO i.a.i.d.a.AzureBlobStorageConnectionChecker(lambda$attemptWriteAndDelete$0):54 - Blob name: brad-test/parts/daily_store_inventory_part_pln_dated.parquet/part-14049-6ac91a9f-5924-4385-971e-43348051854c-c000.snappy.parquetSnapshot: null
2022-08-26 16:10:05 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-08-26 16:10:05 INFO i.a.i.d.a.AzureBlobStorageConnectionChecker(lambda$attemptWriteAndDelete$0):54 - Blob name: brad-test/parts/daily_store_inventory_part_pln_dated.parquet/part-14050-6ac91a9f-5924-4385-971e-43348051854c-c000.snappy.parquetSnapshot: null
Log4j2Appender says: 2022-08-26 16:10:05 INFO i.a.i.d.a.AzureBlobStorageConnectionChecker(lambda$attemptWriteAndDelete$0):54 - Blob name: brad-test/parts/daily_store_inventory_part_pln_dated.parquet/part-14050-6ac91a9f-5924-4385-971e-43348051854c-c000.snappy.parquetSnapshot: null
2022-08-26 16:10:05 INFO i.a.w.i.DefaultAirbyteStreamFactory(lambda$create$0):61 - 2022-08-26 16:10:05 INFO i.a.i.d.a.AzureBlobStorageConnectionChecker(lambda$attemptWriteAndDelete$0):54 - Blob name: brad-test/parts/daily_store_inventory_part_pln_dated.parquet/part-14051-6ac91a9f-5924-4385-971e-43348051854c-c000.snappy.parquetSnapshot: null
Log4j2Appender says: 2022-08-26 16:10:05 INFO i.a.i.d.a.AzureBlobStorageConnectionChecker(lambda$attemptWriteAndDelete$0):54 - Blob name: brad-test/parts/daily_store_inventory_part_pln_dated.parquet/part-14051-6ac91a9f-5924-4385-971e-43348051854c-c000.snappy.parquetSnapshot: null

Steps to Reproduce

  1. Setup a Azure Blob Container and fill with millions of files.
  2. Setup destination Azure Blob Storage and test the connection.
  3. The call to check_connection/ API endpoint will be pending forever. You can view log output on worker to see that the worker is looping through every single file.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions