Skip to content

feat: Run monitoring follower with acceptance testing #10816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 84 commits into from
Mar 5, 2025

Conversation

usmanmani1122
Copy link
Contributor

@usmanmani1122 usmanmani1122 commented Jan 8, 2025

closes: #XXXX
refs: #XXXX

Description

This PR runs a loadgen follower in the accpetance proposal pre test hook and monitor the chain while the acceptance proposal is executing
The generated artifacts are also uploaded to Github

Security Considerations

None

Scaling Considerations

None

Documentation Considerations

None

Testing Considerations

None

Upgrade Considerations

None

@usmanmani1122 usmanmani1122 self-assigned this Jan 8, 2025
Copy link

cloudflare-workers-and-pages bot commented Jan 8, 2025

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: b3be5a1
Status: ✅  Deploy successful!
Preview URL: https://01583980.agoric-sdk.pages.dev
Branch Preview URL: https://usman-acceptance-pre-test.agoric-sdk.pages.dev

View logs

@usmanmani1122 usmanmani1122 added the force:integration Force integration tests to run on PR label Jan 22, 2025
@usmanmani1122 usmanmani1122 requested a review from mhofman March 3, 2025 08:56
Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Besides a couple requested optimization, I'd also like for @michaelfig to take a look as he's more familiar with shell scripting than I am, and has some historical interest in the integration test.

Comment on lines 304 to 311
- id: build-cosmic-swingset
name: Build cosmic-swingset dependencies
run: |
set -o errexit

yarn install
make --directory packages/cosmic-swingset all
working-directory: agoric-sdk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this before the verify SDK image didn't change step, as a build of the cosmos side of the SDK should not cause the docker image to rebuild either, and if it does it's a bug in the dockerignore file / Dockerfile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 6 to 11
if ! test -z "$MESSAGE_FILE_PATH"; then
echo "Waiting for 'ready' message from follower"
# make sure the follower has not crashed
node "$DIRECTORY_PATH/wait-for-follower.mjs" '^(ready)|(exit code \d+)$' | grep --extended-regexp --silent "^ready$"
echo "Follower is ready"
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're spending a significant amount of time waiting for the follower to be ready, I am wondering if we shouldn't move this test lower in the suite. There isn't really a drawback as the follower will still execute all blocks the same, just some of them as catchup instead.

Currently it take ~250 empty blocks for the follower to be ready, and about 300 blocks after the follower is ready is the beginning of "ACCEPTANCE TESTING wallet", so I'd move the ready test at least there or later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I can confirm that it now runs in parallel and the test suite didn't block on the follower being ready. It looks like it didn't reduce the overall test time, but that might be due to the runner being slow in general (which would likely be alleviated by #11003)

@mhofman mhofman requested a review from michaelfig March 4, 2025 04:01
@usmanmani1122 usmanmani1122 requested a review from mhofman March 4, 2025 11:37
Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for working through this.

One final nit, but feel free to merge after that. Would still love to get @michaelfig's eyes on it, but we can always revisit.

Comment on lines 6 to 11
if ! test -z "$MESSAGE_FILE_PATH"; then
echo "Waiting for 'ready' message from follower"
# make sure the follower has not crashed
node "$DIRECTORY_PATH/wait-for-follower.mjs" '^(ready)|(exit code \d+)$' | grep --extended-regexp --silent "^ready$"
echo "Follower is ready"
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I can confirm that it now runs in parallel and the test suite didn't block on the follower being ready. It looks like it didn't reduce the overall test time, but that might be due to the runner being slow in general (which would likely be alleviated by #11003)

Comment on lines 298 to 299
yarn install
make --directory packages/cosmic-swingset all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't pay close attention to the commands here, where do these come from?
I'd expect the following:

Suggested change
yarn install
make --directory packages/cosmic-swingset all
cd packages/cosmic-swingset
make install

In particular there should be no need to redo yarn install as that was done as part of the restore-node step. Given caches these are mostly equivalent, just surprised to see something different than deployment-test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@usmanmani1122 usmanmani1122 added the automerge:squash Automatically squash merge label Mar 5, 2025
@mergify mergify bot merged commit baa7c08 into master Mar 5, 2025
94 checks passed
@mergify mergify bot deleted the usman/acceptance-pre-test branch March 5, 2025 10:01
Copy link
Member

@michaelfig michaelfig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider cribbing from an existing wait_for_rpc check instead of implementing it from scratch.

Comment on lines +60 to +69
curl "$rpc_address" --max-time "5" --silent > /dev/null 2>&1
status_code="$?"

echo "rpc '$rpc_address' responded with '$status_code'"

if ! test "$status_code" -eq "0"; then
sleep 5
else
break
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait_for_rpc will return success even if the node has not caught up with the rest of the chain. Instead, take inspiration from instagoric's readinessProbe

    json=$(curl "$rpc_address/status" --max-time 5 --silent 2>/dev/null | jq .result.sync_info.catching_up)

    echo "rpc '$rpc_address' responded with 'catching_up=$json'"

    test "$json" != false || break
    sleep 5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait_for_rpc will return success even if the node has not caught up with the rest of the chain.

Well this is a single node chain, and we're using the validator as RPC in this case, so it is sufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I put that incorrectly. There is a situation where when the chain is booting for the first time, the RPC port becomes available, but it will fail queries until the first block is committed. I don't know if the rest of this feature is tolerant of that situation.

Maybe this is a case where we can rely on already having committed blocks available in state (from the history of the a3p-integration upgrades). I'm wary because I've struggled with flakes in other tests where the "wait-for-chain" function didn't actually wait for both RPC to start listening as well as genesis to be committed.

I'm just concerned that this function might get cargo-culted without a caveat that it is brittle.

Copy link
Member

@mhofman mhofman Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yeah we know for a fact the chain has already bootstrapped in this case. It might be good to add a disclaimer about this assumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:squash Automatically squash merge force:integration Force integration tests to run on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants