-
Notifications
You must be signed in to change notification settings - Fork 245
feat: Run monitoring follower with acceptance testing #10816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deploying agoric-sdk with
|
Latest commit: |
b3be5a1
|
Status: | ✅ Deploy successful! |
Preview URL: | https://01583980.agoric-sdk.pages.dev |
Branch Preview URL: | https://usman-acceptance-pre-test.agoric-sdk.pages.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Besides a couple requested optimization, I'd also like for @michaelfig to take a look as he's more familiar with shell scripting than I am, and has some historical interest in the integration test.
.github/workflows/integration.yml
Outdated
- id: build-cosmic-swingset | ||
name: Build cosmic-swingset dependencies | ||
run: | | ||
set -o errexit | ||
|
||
yarn install | ||
make --directory packages/cosmic-swingset all | ||
working-directory: agoric-sdk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this before the verify SDK image didn't change
step, as a build of the cosmos side of the SDK should not cause the docker image to rebuild either, and if it does it's a bug in the dockerignore
file / Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if ! test -z "$MESSAGE_FILE_PATH"; then | ||
echo "Waiting for 'ready' message from follower" | ||
# make sure the follower has not crashed | ||
node "$DIRECTORY_PATH/wait-for-follower.mjs" '^(ready)|(exit code \d+)$' | grep --extended-regexp --silent "^ready$" | ||
echo "Follower is ready" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we're spending a significant amount of time waiting for the follower to be ready, I am wondering if we shouldn't move this test lower in the suite. There isn't really a drawback as the follower will still execute all blocks the same, just some of them as catchup instead.
Currently it take ~250 empty blocks for the follower to be ready, and about 300 blocks after the follower is ready is the beginning of "ACCEPTANCE TESTING wallet", so I'd move the ready test at least there or later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I can confirm that it now runs in parallel and the test suite didn't block on the follower being ready. It looks like it didn't reduce the overall test time, but that might be due to the runner being slow in general (which would likely be alleviated by #11003)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for working through this.
One final nit, but feel free to merge after that. Would still love to get @michaelfig's eyes on it, but we can always revisit.
if ! test -z "$MESSAGE_FILE_PATH"; then | ||
echo "Waiting for 'ready' message from follower" | ||
# make sure the follower has not crashed | ||
node "$DIRECTORY_PATH/wait-for-follower.mjs" '^(ready)|(exit code \d+)$' | grep --extended-regexp --silent "^ready$" | ||
echo "Follower is ready" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I can confirm that it now runs in parallel and the test suite didn't block on the follower being ready. It looks like it didn't reduce the overall test time, but that might be due to the runner being slow in general (which would likely be alleviated by #11003)
.github/workflows/integration.yml
Outdated
yarn install | ||
make --directory packages/cosmic-swingset all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't pay close attention to the commands here, where do these come from?
I'd expect the following:
yarn install | |
make --directory packages/cosmic-swingset all | |
cd packages/cosmic-swingset | |
make install |
In particular there should be no need to redo yarn install
as that was done as part of the restore-node
step. Given caches these are mostly equivalent, just surprised to see something different than deployment-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider cribbing from an existing wait_for_rpc
check instead of implementing it from scratch.
curl "$rpc_address" --max-time "5" --silent > /dev/null 2>&1 | ||
status_code="$?" | ||
|
||
echo "rpc '$rpc_address' responded with '$status_code'" | ||
|
||
if ! test "$status_code" -eq "0"; then | ||
sleep 5 | ||
else | ||
break | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wait_for_rpc
will return success even if the node has not caught up with the rest of the chain. Instead, take inspiration from instagoric's readinessProbe
json=$(curl "$rpc_address/status" --max-time 5 --silent 2>/dev/null | jq .result.sync_info.catching_up)
echo "rpc '$rpc_address' responded with 'catching_up=$json'"
test "$json" != false || break
sleep 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
wait_for_rpc
will return success even if the node has not caught up with the rest of the chain.
Well this is a single node chain, and we're using the validator as RPC in this case, so it is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I put that incorrectly. There is a situation where when the chain is booting for the first time, the RPC port becomes available, but it will fail queries until the first block is committed. I don't know if the rest of this feature is tolerant of that situation.
Maybe this is a case where we can rely on already having committed blocks available in state (from the history of the a3p-integration upgrades). I'm wary because I've struggled with flakes in other tests where the "wait-for-chain" function didn't actually wait for both RPC to start listening as well as genesis to be committed.
I'm just concerned that this function might get cargo-culted without a caveat that it is brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Yeah we know for a fact the chain has already bootstrapped in this case. It might be good to add a disclaimer about this assumption.
closes: #XXXX
refs: #XXXX
Description
This PR runs a loadgen follower in the accpetance proposal pre test hook and monitor the chain while the acceptance proposal is executing
The generated artifacts are also uploaded to Github
Security Considerations
None
Scaling Considerations
None
Documentation Considerations
None
Testing Considerations
None
Upgrade Considerations
None