-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Azure Blob Storage destination crashes due to lack of buffering #5980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@vholmer thanks for the heads up! Will get to it |
Just wanted to comment here as we had the same issue and have patched a local copy of the destination connector to fix the issue. The issue appears to be that the connector uses getBlobOutputStream to get an output stream and uses that directly. Azure recommends wrapping it in a BufferedOutputStream. Doing this appears to fix the issue and make things work quite a bit faster, however we also set the PrintWriter autflush to false which may also be contributing. The issue, I believe, is that as an output stream it commits too quickly and the SDK from azure tries to auto detect block size. Since its not buffering and autoflushing the block size is probably set to the smallest possible, depending on how large the rows are on the source. Azure Block Blob does have a hard limit of 50k blocks per blob, which you cannot get around, increasing the block size is the only way to do it. I didn't find a good way of altering the block size directly through the connector, but disabling autoflush and wrapping in BufferedOutputStream have solved our issues. See modified csv/AzureBlobStorageCsvWriter.java below.
|
@bmatticus this is great, thanks for sharing! Would you be open to submitting a PR with your changes? |
I have one in now, 9190 |
@vholmer the PR has been merged and 0.1.1 version of the destination is now available. I think it should fix your issue. The default setting is 5MB buffer at the moment, if you have tables larger than about 195GB you may need to increase the buffer accordingly. Azure documents their blob block sizes, https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs. It goes up to 4000MB, but the max buffer supported by buffered writer is about 2048MB. There is some logic the azure append SDK does that determines the block size, I've not dug into that deep enough to understand it entirely. |
I can't reproduce this issue. I believe it was fixed in scope of #9190 |
Enviroment
Current Behavior
Sync crashes after reaching 50000 blocks in file due to the fact that BlobStorageClient doesn't utilize its own buffering. Each message is sent in its own block which is not optimal. We should implement some kind of buffering before we send a message to the blob.
Expected Behavior
Sync shouldn't crash, the file should keep growing without reaching a high committed block count.
Other details
The file was automatically created & filled by Airbyte and contained 50000 lines and 50000 committed blocks.
Logs
LOG
Steps to Reproduce
Are you willing to submit a PR?
No.
The text was updated successfully, but these errors were encountered: