Skip to content

FIX: Add retry mechanism with exponential backoff for buildkit bootstrap step #424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

danielamar101-pton
Copy link

@danielamar101-pton danielamar101-pton commented May 5, 2025

This should fix: #283

There is a race condition in the buildx setup process:

  • When the action runs docker buildx inspect --bootstrap --builder , it's trying to initialize the buildkit daemon.
  • The error occurs because the action is trying to communicate with buildkit via its socket file (/run/buildkit/buildkitd.sock) before the daemon is fully initialized.

The fix here is to continue retrying until the daemon is fully initialized, and timing the bootstrap out after 15 seconds to attempt a retry, instead of waiting many minutes.

Try and replace docker/setup-buildx-action@v3 with: danielamar101-pton/setup-buildx-action@fix-buildx-bootstrap-race-condition to use my version, and notice a reduction in waiting for this action.

@danielamar101-pton danielamar101-pton force-pushed the fix-buildx-bootstrap-race-condition branch from 77f0c14 to 66d7daa Compare May 5, 2025 21:09
@danielamar101-pton
Copy link
Author

danielamar101-pton commented May 6, 2025

@crazy-max thoughts here? We used the fix here and noticed a considerable reduction in time waiting for this action, when it displays this intermittent race condition.

Copy link
Member

@crazy-max crazy-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contrib!

Left some comments and suggestions.

I'm also wondering if we should not have this built-in in Buildx. Will discuss with the team about this.

Comment on lines +212 to +217
// Create a promise that will timeout after 15 seconds
const timeoutPromise = new Promise<never>((_, reject) => {
setTimeout(() => {
reject(new Error('Timeout exceeded while waiting for buildkit to initialize'));
}, 15000); // 15 second timeout
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this timeout promise, inspect command already have some timeout: https://github.com/docker/buildx/blob/ea2b7020a4645bff395eb49e4e87ef08ba24eb93/commands/inspect.go#L38-L40

Copy link
Author

@danielamar101-pton danielamar101-pton May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout is the core of the change however, as we found sometimes this command execution can hang during high resource contention, in particular IO bottlenecks, in our infrastructure.

This approach is a classic example of the "end-to-end principle" in distributed systems - reliability mechanisms at lower layers can't always account for failures at higher layers, so an additional timeout at the highest level provides more comprehensive protection against various failure modes across the entire stack.

@danielamar101-pton
Copy link
Author

@crazy-max implemented the change to retry for any error instead of just buildx errors, as well as reduced the max retry attempts. Your final request warrants further discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Occasional slow startup times during Booting builder step
2 participants