-
Notifications
You must be signed in to change notification settings - Fork 197
FIX: Add retry mechanism with exponential backoff for buildkit bootstrap step #424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
FIX: Add retry mechanism with exponential backoff for buildkit bootstrap step #424
Conversation
Signed-off-by: Daniel Amar <[email protected]>
Signed-off-by: Daniel Amar <[email protected]>
77f0c14
to
66d7daa
Compare
@crazy-max thoughts here? We used the fix here and noticed a considerable reduction in time waiting for this action, when it displays this intermittent race condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contrib!
Left some comments and suggestions.
I'm also wondering if we should not have this built-in in Buildx. Will discuss with the team about this.
// Create a promise that will timeout after 15 seconds | ||
const timeoutPromise = new Promise<never>((_, reject) => { | ||
setTimeout(() => { | ||
reject(new Error('Timeout exceeded while waiting for buildkit to initialize')); | ||
}, 15000); // 15 second timeout | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this timeout promise, inspect
command already have some timeout: https://github.com/docker/buildx/blob/ea2b7020a4645bff395eb49e4e87ef08ba24eb93/commands/inspect.go#L38-L40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout is the core of the change however, as we found sometimes this command execution can hang during high resource contention, in particular IO bottlenecks, in our infrastructure.
This approach is a classic example of the "end-to-end principle" in distributed systems - reliability mechanisms at lower layers can't always account for failures at higher layers, so an additional timeout at the highest level provides more comprehensive protection against various failure modes across the entire stack.
@crazy-max implemented the change to retry for any error instead of just buildx errors, as well as reduced the max retry attempts. Your final request warrants further discussion. |
This should fix: #283
There is a race condition in the buildx setup process:
The fix here is to continue retrying until the daemon is fully initialized, and timing the bootstrap out after 15 seconds to attempt a retry, instead of waiting many minutes.
Try and replace
docker/setup-buildx-action@v3
with:danielamar101-pton/setup-buildx-action@fix-buildx-bootstrap-race-condition
to use my version, and notice a reduction in waiting for this action.